[00:05:26] 10Operations, 10monitoring: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996#4144805 (10Dzahn) https://exchange.nagios.org/directory/Patches/Nagios-Core/IPv6-address-in-host-definition-patch/details [00:12:42] (03PS1) 10Dzahn: icinga: enable paging and set contact_group for grid engine checks [puppet] - 10https://gerrit.wikimedia.org/r/427833 (https://phabricator.wikimedia.org/T177850) [00:32:23] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 332 MB (3% inode=75%) [01:44:10] (03PS4) 10Krinkle: Remove obsolete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG) [01:44:59] (03CR) 10Krinkle: [C: 032] "Unused. Confirmed via code search and github @wikimedia." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG) [01:46:21] (03Merged) 10jenkins-bot: Remove obsolete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG) [01:46:58] Staging on mwdebug1002 [01:49:09] (03CR) 10jenkins-bot: Remove obsolete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG) [01:50:29] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: If8fdce707d (duration: 01m 17s) [01:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:32] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 34.58, 33.33, 32.05 [03:16:02] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [03:19:32] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 34.97, 33.03, 32.15 [03:22:33] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 35.85, 32.85, 32.14 [03:38:52] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 52.55, 21.03, 15.07 [03:40:52] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 24.58, 22.25, 16.32 [05:12:52] PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused [05:13:13] PROBLEM - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [05:19:41] (03CR) 10ArielGlenn: [C: 032] keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [05:20:55] !log ariel@tin Started deploy [dumps/dumps@c2d3bb4]: keep completed stubs/abstracts/logs files around for retries [05:20:59] !log ariel@tin Finished deploy [dumps/dumps@c2d3bb4]: keep completed stubs/abstracts/logs files around for retries (duration: 00m 04s) [05:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:15] (03PS8) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) [05:23:13] (03CR) 10ArielGlenn: [C: 032] set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn) [05:26:19] (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148) [05:27:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:28:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:29:15] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui) [05:30:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1110 for alter table (duration: 01m 17s) [05:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:02] !log Deploy schema change on db1110 - T191519 T188299 T190148 [05:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:09] T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519 [05:31:09] T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148 [05:31:09] T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299 [05:31:32] !log Start atop on db1114 with "-R" option enabled - T192551 [05:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:37] T192551: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551 [05:32:22] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4145128 (10Marostegui) No errors running atop without "-R". I have just started it with "-R" to see if errors start showing up. [05:32:22] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.71, 33.17, 32.10 [05:42:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846 [05:55:47] <_joe_> !log depooling mw1227 from live traffic for investigation [05:55:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:52] PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846 (owner: 10Marostegui) [06:05:13] PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:05:22] PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:06:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846 (owner: 10Marostegui) [06:07:12] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 0.21, 7.94, 23.31 [06:07:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 01m 16s) [06:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:59] !log Stop mysql db1114 for a reboot [06:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846 (owner: 10Marostegui) [06:11:43] RECOVERY - Check systemd state on db1114 is OK: OK - running: The system is fully operational [06:16:51] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4145151 (10Marostegui) As soon as it was started there was a spike of errors. So looks like -R is the offender here. {F17173512} I have left atop started without -R and will leave it like that for the we... [06:17:07] 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4145152 (10Marostegui) p:05Triage>03Normal [06:17:27] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848 [06:22:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848 (owner: 10Marostegui) [06:23:30] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848 (owner: 10Marostegui) [06:23:45] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848 (owner: 10Marostegui) [06:25:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 01m 15s) [06:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:57] !log kafka::analytics remove strongswan leftovers T185136 [06:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:04] T185136: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136 [06:32:12] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:33:05] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850 [06:49:11] (03PS2) 10Elukey: Set Debian Stretch as target OS for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/427702 (https://phabricator.wikimedia.org/T192557) [06:49:55] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850 (owner: 10Marostegui) [06:50:19] (03CR) 10Elukey: [C: 032] Set Debian Stretch as target OS for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/427702 (https://phabricator.wikimedia.org/T192557) (owner: 10Elukey) [06:51:14] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850 (owner: 10Marostegui) [06:51:29] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850 (owner: 10Marostegui) [06:54:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore main traffic original weight for db1114 (duration: 01m 15s) [06:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:31] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145172 (10Gilles) There's no leading space: ``` gilles@mwlog1001:~$ sed -n '58775,58777p' /srv/xenon/logs/daily/2018-04-17.all.log api.php;{GET};A... [06:59:45] (03CR) 10Gilles: Filter out invalid records in xenon-log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [07:01:00] 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#4145173 (10hashar) 05declined>03Open That is still broken. On project having a puppetmaster, any new instance ends up with a broken Puppet. The reason is firstboot.sh running pup... [07:01:51] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 [07:01:54] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 [07:03:36] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#3917698 (10ema) We've had the following Icinga `UNKNOWN` on netmon2001 for the past 6 days: ```Postgres Replication Lag - ERROR: FATAL: no pg_hba.conf entry for host "2620:0:86... [07:04:00] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 (owner: 10Marostegui) [07:05:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 (owner: 10Marostegui) [07:05:33] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 31.95, 31.23, 32.09 [07:06:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1110 after alter table (duration: 01m 16s) [07:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:30] ema: esams 50x alarm incoming https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X [07:08:15] elukey: thanks [07:08:32] wow.. preemptive alarm service :P [07:08:42] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854 [07:08:57] !log upgrading API servers in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [07:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:08] !log cp3032/cp3043: restart varnish-be due to mbox lag [07:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:50] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 (owner: 10Marostegui) [07:10:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854 (owner: 10Marostegui) [07:11:04] elukey: FTR https://gerrit.wikimedia.org/r/#/c/426858/ should bring these things back to an acceptable level [07:11:33] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 34.07, 32.18, 32.14 [07:11:33] nice! [07:11:53] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854 (owner: 10Marostegui) [07:13:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 in API (duration: 01m 15s) [07:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:02] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:14:44] ^ mitigated w/ manual restarts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=now-30m&to=now [07:15:12] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:15:18] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145185 (10Gilles) I've confirmed that the PHP Redis client we use will incorrectly remove the leading space: ``` gilles@terbium:~$ mwscript eval.ph... [07:15:27] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854 (owner: 10Marostegui) [07:15:38] (03Abandoned) 10Gilles: Filter out invalid records in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [07:15:53] (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855 [07:23:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:24:02] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:30:50] <_joe_> !log upgrading hhvm on all jobrunners in eqiad [07:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:44] !log cp3030: restart varnish-be due to mbox lag [07:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:03] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:37:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:37:56] !log upgrade qemu on ganeti2006 to 1:2.8+dfsg-3~bpo8+1 and migrate mwdebug2001 to it T150532 [07:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:02] T150532: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532 [07:38:03] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855 (owner: 10Marostegui) [07:38:22] !log cp3041: restart varnish-be due to mbox lag [07:38:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:21] (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855 (owner: 10Marostegui) [07:39:28] (03PS1) 10Jcrespo: mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861 [07:39:34] (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855 (owner: 10Marostegui) [07:40:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more API traffic to db1114 (duration: 01m 17s) [07:41:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:36] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861 (owner: 10Jcrespo) [07:43:48] (03Merged) 10jenkins-bot: mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861 (owner: 10Jcrespo) [07:44:50] (03CR) 10jenkins-bot: mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861 (owner: 10Jcrespo) [07:48:36] !log upgrading app servers in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [07:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:50] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4145201 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']... [07:52:22] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:52:38] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2071 (duration: 01m 16s) [07:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:12] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [07:54:02] 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#4145203 (10fgiunchedi) Thanks @Imarlier for the explanation and insight! Makes sense to me, the other thing I suggest checking is coal's whisper files aggregation/retetion peri... [07:57:01] !log starting reimage of db2071 [07:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:42] PROBLEM - DPKG on mw1303 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:58:37] 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4145207 (10fgiunchedi) >>! In T192124#4144670, @demon wrote: > That was part of that commit. I was kinda following the example set b... [07:58:41] RECOVERY - DPKG on mw1303 is OK: All packages OK [07:59:04] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868 [08:02:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868 (owner: 10Marostegui) [08:03:18] (03CR) 10Filippo Giunchedi: [C: 031] Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata) [08:03:32] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868 (owner: 10Marostegui) [08:05:30] Hi. Any op around for a quick script run (initSiteStats)? [08:05:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1114 in API - T191996 (duration: 01m 16s) [08:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:37] T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996 [08:08:47] PROBLEM - DPKG on mw1310 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:08:48] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 36.99, 34.28, 32.07 [08:09:11] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868 (owner: 10Marostegui) [08:09:47] RECOVERY - DPKG on mw1310 is OK: All packages OK [08:11:57] PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:12:57] RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.006 second response time [08:13:16] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4145228 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1005.eqiad.wmnet'] ``` and were **ALL** successful. [08:20:00] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:20:37] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145252 (10Gilles) Found the explanation... the "1" I was getting from the python client wasn't the message I sent (in my test above I had sent it be... [08:21:01] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time [08:22:54] (03PS2) 10Elukey: Changes needed for upgrading to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [08:23:03] (03PS1) 10Gilles: Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) [08:23:20] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time [08:24:20] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.007 second response time [08:25:57] (03PS2) 10Gilles: Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) [08:27:10] PROBLEM - DPKG on mw1309 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [08:28:16] RECOVERY - DPKG on mw1309 is OK: All packages OK [08:29:33] 10Operations, 10Performance-Team, 10Patch-For-Review, 10Security: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145288 (10Gilles) [08:29:56] 10Operations, 10Performance-Team, 10Patch-For-Review, 10Security: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145291 (10Gilles) [08:37:58] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/10990/" [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata) [08:39:35] !log restart hhvm on mw[1226,1232].eqiad.wmnet - high load [08:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:49] !log Going to sanitize gorwiki euwikisource romdwikimedia inhwiki on db1095 - T189112 T189466 T187774 T184375 [08:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:59] T184375: Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375 [08:39:59] T189466: Prepare storage layer for euwikisource - https://phabricator.wikimedia.org/T189466 [08:39:59] T187774: Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774 [08:39:59] T189112: Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112 [08:41:46] !log upgrading job runners in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [08:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:36] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 7.73, 15.19, 23.90 [08:49:16] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 8.24, 13.06, 23.69 [09:02:40] (03CR) 10Filippo Giunchedi: [C: 032] Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [09:02:45] (03PS3) 10Filippo Giunchedi: Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [09:06:15] !log upgrading video scalers in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [09:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:38] (03CR) 10Filippo Giunchedi: "I'll check back on cronspam from xenon-log" [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [09:12:15] volunteers for https://gerrit.wikimedia.org/r/c/427619/ ? [09:12:19] !log restart of mw apis showing ~50% cpu utilization as precaution before the weekend - mw[1224,1225,1228,1230,1231,1233-1235,1276-1283,1286,1312,1313,1315,1316,1341,1343,1344,1347,1348]* [09:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:53] moritzm: --^ (-b 2) [09:13:24] Hauskatze: For what wiki? [09:13:37] It ran on all wikis last Sunday (we now have a cron for that). [09:13:59] for the newly created ones, but if there's a crone I'll let folks know on the task [09:14:12] they complain that the article counts are still at 0 while everything has been imported [09:14:23] not sure if we need an initial manual run though [09:14:59] No, it shouldn't need one. It should update automatically. That said, this didn't work that great in the past (hence the cron). [09:15:53] Hauskatze: The next cron run will be on 30th iirc. Up to you whether to wait that long. If not, I'd just sign it up for SWAT, should be easy enough. [09:17:07] not my wikis, but maybe I'll add a note on the swat page for a dev to do that in their spare time [09:18:23] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 [09:19:46] PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied [09:20:38] arturo: ^ [09:22:12] (03PS2) 10Jcrespo: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 [09:22:14] (03PS1) 10Jcrespo: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 [09:22:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo) [09:22:32] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 (owner: 10Jcrespo) [09:22:37] I don't fully understand the purpose of that check [09:22:46] RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK [09:23:54] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 (owner: 10Jcrespo) [09:25:06] (03PS2) 10Jcrespo: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 [09:26:39] <_joe_> arturo: a check on free disk space? [09:27:12] you don't need concrete directory permissions for that [09:27:16] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo) [09:27:36] <_joe_> oh so you don't understand why it's failing [09:28:04] I don't understand why icinga cares about a directory created by root in my home directory :-P [09:28:28] (03Merged) 10jenkins-bot: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo) [09:29:16] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 (owner: 10Jcrespo) [09:32:18] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879 [09:32:54] (03PS1) 10Jcrespo: Revert "install_server: Allow stretch reimage of db207* except db2079" [puppet] - 10https://gerrit.wikimedia.org/r/427880 [09:33:23] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2071, depool db2070 (duration: 01m 16s) [09:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:35] You need r-x for a directory to get the size of its contents, at least with `du`. [09:35:42] 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4145427 (10fgiunchedi) Thanks @bblack for taking a look! Looks like an "heavy" query from `varnish-failed-fetches` drove disk utilization to 100% starting at 04/19T22:17 and things snowballed from there,... [09:35:57] 10Operations, 10monitoring, 10User-fgiunchedi: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4145430 (10fgiunchedi) [09:36:31] (03PS1) 10Jcrespo: install_server: Revert patch allowing reimage of db207* hosts [puppet] - 10https://gerrit.wikimedia.org/r/427881 [09:36:38] (03Abandoned) 10Jcrespo: Revert "install_server: Allow stretch reimage of db207* except db2079" [puppet] - 10https://gerrit.wikimedia.org/r/427880 (owner: 10Jcrespo) [09:39:22] 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4145433 (10mmodell) [09:39:56] 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128785 (10mmodell) >>! In T192124#4145207, @fgiunchedi wrote: > I don't know about conftool but AFAICS scap doesn't contain archite... [09:41:07] (03Abandoned) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans) [09:41:49] !log upgrading mwdebug servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [09:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:59] !log starting reimage of db2070 [09:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:43] (03CR) 10Volans: "> That requires the cumin master to use Stretch. On Jessie there are" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans) [09:44:57] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4145439 (10Gehel) >>! In T185504#4145176, @ema wrote: > It's also unclear to me whether `UNKNOWN` is the proper severity for this issue, it should perhaps be `CRITICAL` instead.... [09:45:36] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [09:51:10] !log upgrading deployment servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [09:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:09] (03PS2) 10Filippo Giunchedi: Bump scap version to 3.8.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/427535 (https://phabricator.wikimedia.org/T192124) (owner: 1020after4) [09:58:28] !log upload scap 3.8.0-2 - T192124 [09:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:34] T192124: Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124 [09:59:03] (03CR) 10Jcrespo: [C: 032] install_server: Revert patch allowing reimage of db207* hosts [puppet] - 10https://gerrit.wikimedia.org/r/427881 (owner: 10Jcrespo) [09:59:41] (03PS3) 10Filippo Giunchedi: Bump scap version to 3.8.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/427535 (https://phabricator.wikimedia.org/T192124) (owner: 1020after4) [09:59:53] twentyafterfour: scap upgraded on tin already, I'll merge the puppet patch [10:00:09] godog: thanks! [10:00:55] (03CR) 10Filippo Giunchedi: [C: 032] Bump scap version to 3.8.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/427535 (https://phabricator.wikimedia.org/T192124) (owner: 1020after4) [10:01:42] jynus: merging your change too [10:02:02] thanks [10:03:52] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145475 (10Aklapper) [10:17:50] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-nutcracker-exporter [puppet] - 10https://gerrit.wikimedia.org/r/427884 (https://phabricator.wikimedia.org/T135991) [10:20:00] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424243 (https://phabricator.wikimedia.org/T135991) [10:20:55] 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145531 (10fgiunchedi) So far so good, the restart didn't cause any cronspam this time around [10:31:04] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4145535 (10Gehel) All wdqs servers are now running RAID on Debian Stretch. Data is fully reloaded. [10:46:58] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991) [10:56:11] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:56:26] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-nutcracker-exporter [puppet] - 10https://gerrit.wikimedia.org/r/427884 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:56:37] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424243 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [11:06:55] !log installing tiff security updates on trusty [11:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:47] !log reimage analytics1068 to Debian Stretch - T192557 [11:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:53] T192557: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557 [11:53:11] !log installing apache security updates on netmon1002/2001 [11:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:57] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879 (owner: 10Jcrespo) [11:56:38] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879 (owner: 10Jcrespo) [12:05:31] !log upgrading apache on einsteinium/icinga.wikimedia.org [12:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:11] ^ completed, icinga back to normal [12:06:35] !log installing apache security updates on video scalers [12:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:12] I used git as root by error on tin, I think I cleaned up permissions after that, but ping me if you see something weird [12:15:49] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2070 (duration: 01m 17s) [12:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:08] !log upgrading and restarting dbstore2002 [12:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:48] !log upgrading apache on auth* servers [12:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:20] (03PS1) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902 [13:00:35] !log installing zsh security updates on trusty servers [13:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:37] (03PS1) 10Jcrespo: mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904 [13:16:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904 (owner: 10Jcrespo) [13:25:00] !log upgrading mysql (as shipped in Debian) on bohrium [13:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:10] Dereckson: I've marked https://phabricator.wikimedia.org/T171013 as UBN to get Wikidata teams attention to get the mess fixed up [13:27:03] Reedy: Will have a look [13:27:15] hoo: I'm sure you're aware of it when you look :P [13:27:15] I though I already documented that on Wikitech like ages ago [13:27:32] Every time we create a wiki... We get to this stage of stuff still not working right [13:27:42] So there's either stuff missing... Or the docs are completely wrong :) [13:27:56] I have a meeting now, but will have a look after (so surely today) [13:28:24] Thanks :) [13:32:14] (03PS1) 10Jcrespo: mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905 [13:32:39] jynus, marostegui: Any objection to me running the CREATE TABLE to recreate the `slots` table now? (following up T190153, and T184446#4143097) [13:32:39] T184446: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446 [13:32:39] T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) - https://phabricator.wikimedia.org/T190153 [13:34:09] anomie: normally we do not like friday deploys, but we can make an exception [13:34:39] I can wait if you'd rather. [13:34:47] it is ok, I am around [13:35:01] better now than in 2 hours, were I will not be aroud :-) [13:35:29] !log (re-)creating `slots` table on all wikis, following up T190153 and T184446#4143097 [13:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:49] normally those are totally safe, but there is a small chance of one of those not being deleted and breaking replication [13:38:03] *having being deleted [13:44:04] (03PS2) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902 [13:44:06] (03PS2) 10Jcrespo: mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904 [13:45:19] jynus: My run of creations is done now, FYI. [13:45:26] thanks [13:55:33] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905 (owner: 10Jcrespo) [13:56:48] (03Merged) 10jenkins-bot: mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905 (owner: 10Jcrespo) [13:59:08] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2086 (duration: 01m 13s) [13:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:29] !log upgrade and restart db2086 [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:08] !log upgrading labweb* servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build [14:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:42] (03PS2) 10Andrew Bogott: Wikitech: change maintenance jobs to use the 'wikitech' dblist [puppet] - 10https://gerrit.wikimedia.org/r/427812 (https://phabricator.wikimedia.org/T189542) [14:03:50] (03CR) 10Andrew Bogott: [C: 032] Wikitech: change maintenance jobs to use the 'wikitech' dblist [puppet] - 10https://gerrit.wikimedia.org/r/427812 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:08:46] (03PS1) 10Andrew Bogott: Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542) [14:13:22] (03CR) 10Andrew Bogott: [C: 032] Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:14:44] (03Merged) 10jenkins-bot: Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:16:52] !log andrew@tin Synchronized dblists: Purging obsolete silver.dblist (duration: 01m 17s) [14:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:16] (03PS1) 10Jcrespo: mariadb: Repool db2086, depool db2087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427912 [14:37:17] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2086, depool db2087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427912 (owner: 10Jcrespo) [14:39:04] (03PS1) 10Jcrespo: mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913 [14:41:50] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2086, depool db2087 (duration: 01m 16s) [14:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:44] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused eevans Decommissioned [14:42:44] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned [14:47:00] (03PS1) 10Andrew Bogott: Add 'wikitech' section for wikitech db hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) [14:55:53] (03CR) 10Jcrespo: "Let's deploy better next week- so we do not accidentally break the other wikis." (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [14:55:56] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#4146092 (10fgiunchedi) [14:57:12] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for clamav-freshclam [puppet] - 10https://gerrit.wikimedia.org/r/427916 (https://phabricator.wikimedia.org/T135991) [15:03:28] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913 (owner: 10Jcrespo) [15:04:52] (03Merged) 10jenkins-bot: mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913 (owner: 10Jcrespo) [15:06:27] (03CR) 10jenkins-bot: mariadb: Repool db2086, depool db2087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427912 (owner: 10Jcrespo) [15:07:03] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2087 (duration: 01m 16s) [15:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:26] (03CR) 10Andrew Bogott: "So... you want both eqiad and codfw sites to point to the codfw db server, and mark eqiad and read-only?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [15:12:48] (03CR) 10Jcrespo: "No, eqiad should point to eqiad, and codfw to codfw, and codfw should be set as read only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [15:13:08] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879 (owner: 10Jcrespo) [15:13:12] (03CR) 10jenkins-bot: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo) [15:13:18] (03CR) 10jenkins-bot: mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905 (owner: 10Jcrespo) [15:13:22] (03CR) 10jenkins-bot: Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [15:13:27] (03CR) 10jenkins-bot: mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913 (owner: 10Jcrespo) [15:19:36] (03CR) 10Eevans: [C: 031] Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:37:01] (03PS1) 10Jcrespo: mariadb: Update mysql 8.0 package [software] - 10https://gerrit.wikimedia.org/r/427926 [16:16:14] (03PS1) 10Herron: WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) [16:16:51] (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) (owner: 10Herron) [16:19:51] (03PS2) 10Herron: WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) [16:25:07] I'll be deploying a fix on group0 wikis for CirrusSearch [16:26:24] ^ blessed by releng, FYI [16:26:51] reminder: group0 is testwikis + mw.org, so our "safe" group [16:39:38] subbu: feel like testing the upload of a test file to the parsoid release archive ? [16:40:02] mutante, sure. [16:40:10] let me look at the ticket [16:40:37] subbu: let me TLDR it for you [16:40:45] k :) [16:40:46] we now have https://releases.wikimedia.org/parsoid/ [16:41:00] and there is a new group called releasers-parsoid [16:41:10] and you are the only member right now [16:41:29] scp sometestfile releases1001.eqiad.wmnet:/srv/org/wikimedia/releases/parsoid/ [16:41:31] could you add arlo and scott to that group as well? [16:42:37] sure, i would just handle it as access request which takes 3 business days... doesnt need ops meeting though [16:43:14] mutante, success https://releases.wikimedia.org/parsoid/ [16:44:42] !log dcausse@tin Synchronized php-1.31.0-wmf.30/extensions/CirrusSearch/: T192609: Do not propagate Elastica doc modifications out of DataSender (duration: 01m 34s) [16:44:43] i dont see a file, is the index cached.. looking on releases1001.eqiad.wmnet [16:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:52] T192609: Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message} - https://phabricator.wikimedia.org/T192609 [16:45:30] subbu: i see them in the file system but by browser keeps showing me empty index.. caching [16:45:48] my browser shows them just fine. [16:45:52] ctrl+r [16:46:07] mutante, so, anything else you want me to test / do? [16:46:10] yea, i also tried adding ?foo and stuff [16:46:13] which usuually works [16:46:25] subbu: no, it's success then. you have a place to archive the files, right [16:46:31] yup. thanks. :) [16:46:40] resolves the tickets except i will add a subtask to add more people and they will have it next week, k [16:46:47] welcome [16:46:52] updated https://www.mediawiki.org/w/index.php?title=Parsoid%2FReleases&type=revision&diff=2761835&oldid=2744784 as well. [16:46:58] nice [16:47:30] subbu: there is just one thing to remember. it could happen in the future that the releases server is switched to codfw [16:47:45] ok [16:47:49] there are 2 backends, 1001 in eqiad and 2001 in codfw [16:47:57] and i setup automatic rsync between them [16:48:04] so if you upload to 1001 it will sync over [16:48:14] and we have active-active setup and serve from both [16:48:23] ok. sounds good. [16:48:30] which releases server is the currently "active" one is configured in Hiera in common.yaml [16:49:18] if that is changed then the rsync would also automatically change direction and 2001 would be the source [16:49:32] nice. [16:49:46] the shell access is based on role, so applies to both. [16:49:48] ok, that's it : [16:49:49] :) [16:51:22] alright! time for lunch then. :) [16:52:31] (03PS1) 10Andrew Bogott: Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) [16:53:13] (03CR) 10Andrew Bogott: "I believe the latest patch implements most of your suggestions. I don't know how to mark a db server as read-only. Also, labtestwikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott) [16:53:49] dcausse: thanks for the quick patch and the deploy! watching https://logstash.wikimedia.org/goto/042b2b3677fa27897418b10ffa49a989 I haven't seen any new instances of the error. Everything look good on your side? [16:54:19] thcipriani: yw, me neither, I'll wait a bit more but everything looks good to me [16:55:01] awesome :) [17:03:33] 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4146464 (10Dzahn) 05Open>03Resolved 12:43 < subbu> mutante, success http... [17:04:49] !log rebooting labvirt1021 and 1022 [17:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:03] (03PS3) 10Herron: WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) [17:12:57] (03Abandoned) 10Chad: scap clean: Use --delete-excluded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424645 (https://phabricator.wikimedia.org/T157030) (owner: 10Chad) [17:14:25] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/10993/" [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) (owner: 10Herron) [17:14:40] (03PS4) 10Herron: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) [17:16:24] ACKNOWLEDGEMENT - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load [17:16:24] ACKNOWLEDGEMENT - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load [17:16:24] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load [17:16:40] ACKNOWLEDGEMENT - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load [17:16:40] ACKNOWLEDGEMENT - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load [17:16:40] ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load [17:17:55] !log phab2001 - upgrading apache, openssl, mysql-common [17:18:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:05] 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146497 (10mepps) @Dzahn my wikitech username is MEpps and here's the public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDMquf0ywthSAksqXIMATkeQt8ui6B2JxWES4zEMQVYtPlVUNnFGQyAbYN/Fe... [17:19:26] twentyafterfour: i'm installing apache updates on phab2001 .. and then also openssl and mysql-common. no issues so far. would also like to hit phab1001 at some point [17:26:01] !log phabricator (phab1001) - upgrading Apache, openssl, mysql-common [17:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:50] !log phabricator - restarted apache [17:28:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:40] PROBLEM - puppet last run on labvirt1022 is CRITICAL: Return code of 255 is out of bounds [17:35:36] !log gerrit: update mysql-client and deps 5.5.59 -> 5.5.60 [17:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:00] PROBLEM - configured eth on labvirt1021 is CRITICAL: Return code of 255 is out of bounds [17:37:41] PROBLEM - dhclient process on labvirt1021 is CRITICAL: Return code of 255 is out of bounds [17:38:29] ^ https://phabricator.wikimedia.org/T183937 [17:38:32] PROBLEM - ensure kvm processes are running on labvirt1021 is CRITICAL: Return code of 255 is out of bounds [17:39:07] these are new but " labvirt1021 has puppet signed but wont run" [17:41:01] I'm setting them up now, they apparently need to make some noise as they come up [17:41:04] I downtimed both [17:41:32] ah :) cool [17:41:57] !log imarlier@tin Started deploy [performance/coal@99db58f]: coal - update to submit via graphite. Not yet active, requires puppet changes [17:42:01] !log imarlier@tin Finished deploy [performance/coal@99db58f]: coal - update to submit via graphite. Not yet active, requires puppet changes (duration: 00m 04s) [17:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:41] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1943 bytes in 0.103 second response time [17:45:42] (03CR) 10Thcipriani: [C: 031] mwdeploy: Ensure home directory exists on all machines [puppet] - 10https://gerrit.wikimedia.org/r/427188 (owner: 10Chad) [17:47:41] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1955 bytes in 0.101 second response time [17:47:51] RECOVERY - dhclient process on labvirt1021 is OK: PROCS OK: 0 processes with command name dhclient [17:48:01] RECOVERY - configured eth on labvirt1021 is OK: OK - interfaces up [17:52:40] RECOVERY - puppet last run on labvirt1022 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [17:54:04] (03PS1) 10Urbanecm: Add logos for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669) [17:56:38] (03PS2) 10Urbanecm: Add logos for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669) [17:58:00] (03PS1) 10Urbanecm: Change timezone for napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427941 (https://phabricator.wikimedia.org/T192568) [18:01:56] 10Operations, 10cloud-services-team (Kanban): rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4146628 (10Andrew) 05Open>03Resolved Both these systems are now puppetized and ready for testing. [18:04:40] RECOVERY - ensure kvm processes are running on labvirt1021 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm [18:05:52] (03PS1) 10Catrope: Enable internationalized maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427943 [18:08:19] ACKNOWLEDGEMENT - configured eth on labvirt1021 is CRITICAL: eth3 reporting no carrier. andrew bogott T192682 Why is this even [18:08:19] ACKNOWLEDGEMENT - configured eth on labvirt1022 is CRITICAL: eth3 reporting no carrier. andrew bogott T192682 Why is this even [18:11:43] (03PS1) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472) [18:12:43] (03PS2) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472) [18:12:52] (03PS3) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472) [18:16:50] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146686 (10Dzahn) @mepps Thanks! looks good. I made the needed puppet code change and uploaded to Gerrit. The next step will be getting this reviewed/merged (i... [18:17:21] (03PS1) 10Imarlier: graphite: add a specific retention rule for coal metrics [puppet] - 10https://gerrit.wikimedia.org/r/427945 (https://phabricator.wikimedia.org/T191994) [18:20:34] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4146697 (10Dzahn) a:03Dzahn [18:23:07] !log add LDAP user "tieu" to group "wmde" (T192256) [18:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:14] T192256: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256 [18:23:56] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4146708 (10Dzahn) [18:24:42] 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4132854 (10Dzahn) 05Open>03Resolved @Tim_WMDE You have been added to the group. tieu is a member of wmde. It should work now. [18:27:25] (03CR) 10Dzahn: [C: 032] "per "attempting to aggregate will corrupt data" and "PITA having to go through the puppet merge process" - just doing it" [puppet] - 10https://gerrit.wikimedia.org/r/427945 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier) [18:29:19] (03CR) 10Dzahn: [C: 04-2] "still want to use "nihonium" but needs to wait until the right, newer, WMF asset number is assigned" [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [18:39:08] mutante: thanks for the quick merge on that graphite config change. [18:46:36] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146737 (10mepps) Thanks @Dzahn! [18:49:36] marlier: welcome:) i saw the PITA comment [18:49:47] i ran puppet on graphite1001 but that was it [18:50:51] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146745 (10Dzahn) a:03Dzahn [18:51:06] I saw that -- it all looks good. [18:51:24] great! [19:19:22] 10Operations, 10Ops-Access-Requests: add arlo and scott to parsoid releasers admin group - https://phabricator.wikimedia.org/T192684#4146838 (10Dzahn) p:05Triage>03Normal [19:23:01] (03PS1) 10Dzahn: admins: add arlolra, cscott to releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427954 (https://phabricator.wikimedia.org/T192684) [19:32:31] (03PS1) 10Urbanecm: Temp rate limit for arwiki due to mass vandalism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) [19:33:25] (03CR) 10jerkins-bot: [V: 04-1] Temp rate limit for arwiki due to mass vandalism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [20:10:18] (03CR) 10Kaldari: "@Ladsgroup: Why does this need a cronjob? Isn't it just a 1-time script?" [puppet] - 10https://gerrit.wikimedia.org/r/424300 (https://phabricator.wikimedia.org/T189596) (owner: 10Ladsgroup) [20:26:37] (03PS1) 10Andrew Bogott: labvirt1021 and 1022: remove special eth interface juggling [puppet] - 10https://gerrit.wikimedia.org/r/428006 (https://phabricator.wikimedia.org/T192682) [20:29:13] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4146959 (10Smalyshev) Great, thanks! [20:40:41] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146985 (10Krinkle) [20:40:51] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146974 (10Krinkle) [20:41:46] (03PS2) 10Andrew Bogott: labvirt1021 and 1022: remove special eth interface juggling [puppet] - 10https://gerrit.wikimedia.org/r/428006 (https://phabricator.wikimedia.org/T192682) [20:41:48] (03PS1) 10Andrew Bogott: labvirt1021 and 1022: Move back to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/428007 [20:43:14] (03CR) 10Andrew Bogott: [C: 032] labvirt1021 and 1022: remove special eth interface juggling [puppet] - 10https://gerrit.wikimedia.org/r/428006 (https://phabricator.wikimedia.org/T192682) (owner: 10Andrew Bogott) [20:43:20] (03CR) 10Andrew Bogott: [C: 032] labvirt1021 and 1022: Move back to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/428007 (owner: 10Andrew Bogott) [20:45:58] !log re-imaging labvirt1021 and 1022 as Jessie [20:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:02] (03CR) 10Rxy: Temp rate limit for arwiki due to mass vandalism (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm) [21:00:06] (03PS1) 10Thcipriani: Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936) [21:00:48] (03CR) 10jerkins-bot: [V: 04-1] Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936) (owner: 10Thcipriani) [21:01:21] (03PS2) 10Thcipriani: Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936) [21:10:45] (03PS1) 10Andrew Bogott: labvirt1021 and 1022: Move to Jessie, second attempt [puppet] - 10https://gerrit.wikimedia.org/r/428019 [21:12:27] (03PS2) 10Andrew Bogott: labvirt1021 and 1022: Move to Jessie, second attempt [puppet] - 10https://gerrit.wikimedia.org/r/428019 [21:13:40] (03CR) 10Andrew Bogott: [C: 032] labvirt1021 and 1022: Move to Jessie, second attempt [puppet] - 10https://gerrit.wikimedia.org/r/428019 (owner: 10Andrew Bogott) [21:18:15] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147059 (10Krinkle) @thcipriani Is this with or without translation cache (TC) and JIT? [21:27:45] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147070 (10thcipriani) This uses the value from `/etc/hhvm/php.ini`: `hhvm.jit = false`. Played with `-vEval.Jit=1` yesterday and it was quite a bit s... [21:56:06] (03CR) 10Dzahn: [C: 04-1] "might not be what was intended, needs clarification on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/427833 (https://phabricator.wikimedia.org/T177850) (owner: 10Dzahn) [21:56:59] (03CR) 10Dzahn: [C: 04-1] "needs some communication effort to tell all the users of this module that there is a change coming" [puppet] - 10https://gerrit.wikimedia.org/r/415510 (owner: 10Dzahn) [21:58:08] (03PS1) 10Andrew Bogott: openstack: Pin a bunch of packages for mitaka/jessie [puppet] - 10https://gerrit.wikimedia.org/r/428023 (https://phabricator.wikimedia.org/T192162) [21:58:45] (03CR) 10Dzahn: "wasn't sure where what to compile this on but i see cache::misc has a superset director with thorium.eqiad.wmnet as a single backend" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [21:59:21] (03CR) 10Andrew Bogott: [C: 032] openstack: Pin a bunch of packages for mitaka/jessie [puppet] - 10https://gerrit.wikimedia.org/r/428023 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott) [22:00:50] 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4147110 (10EddieGP) [22:04:49] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146974 (10MarcoAurelio) Maybe we should change it to `wiki@wikimedia.beta.wmflabs.org`. Does that address need to exist? [22:08:01] 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-Configuration, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4147125 (10Dzahn) [22:10:50] 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-Configuration, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146974 (10Dzahn) >>! In T192686#4147116, @MarcoAurelio wrote: > Maybe we should change it to `wiki@wikim... [22:19:46] (03Draft1) 10MarcoAurelio: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) [22:19:50] (03PS2) 10MarcoAurelio: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) [22:22:19] (03PS1) 10Andrew Bogott: openstack compute: Fix monitoring of kvm processes on Debian [puppet] - 10https://gerrit.wikimedia.org/r/428027 [22:24:54] (03CR) 10Dzahn: [C: 031] "208.80.155.135" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio) [22:26:58] 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-Configuration, 10Release-Engineering-Team, 10Patch-For-Review: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4147140 (10MarcoAurelio) With regards to my patch above, we should investigate if t... [22:29:40] (03CR) 10Andrew Bogott: [C: 032] openstack compute: Fix monitoring of kvm processes on Debian [puppet] - 10https://gerrit.wikimedia.org/r/428027 (owner: 10Andrew Bogott) [22:55:37] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147196 (10Krinkle) @thcipriani Hm.. these are seconds though, as opposed to minutes. Is there something different about these commands? I'm asking bec... [22:58:07] 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147197 (10thcipriani) >>! In T191921#4147196, @Krinkle wrote: > @thcipriani Hm.. these are seconds though, as opposed to minutes. Is there something d... [22:59:13] (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse) [22:59:25] (03PS3) 10Krinkle: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio) [22:59:59] (03CR) 10Krinkle: "Moved it nearby the other mail-related setting. There are three in total. The third one is wgEmergencyContact, but it seems that one is no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio) [23:00:02] (03CR) 10Krinkle: [C: 031] labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio) [23:33:07] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 347 MB (3% inode=75%) [23:55:40] (03CR) 10Krinkle: [C: 032] labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio) [23:56:57] (03Merged) 10jenkins-bot: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio) [23:58:54] (03PS1) 10Bstorm: wiki replicas: Depool labsdb1010 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428037 (https://phabricator.wikimedia.org/T184446) [23:59:07] (03CR) 10jenkins-bot: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio) [23:59:59] Krinkle: may also want to check [[MediaWiki:Emailsender]] on the relevant wikis as well as that sets the from name for the email