[00:05:26] <wikibugs>	 10Operations, 10monitoring: Icinga check for ipv6 host reachability - https://phabricator.wikimedia.org/T163996#4144805 (10Dzahn) https://exchange.nagios.org/directory/Patches/Nagios-Core/IPv6-address-in-host-definition-patch/details
[00:12:42] <wikibugs>	 (03PS1) 10Dzahn: icinga: enable paging and set contact_group for grid engine checks [puppet] - 10https://gerrit.wikimedia.org/r/427833 (https://phabricator.wikimedia.org/T177850)
[00:32:23] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 332 MB (3% inode=75%)
[01:44:10] <wikibugs>	 (03PS4) 10Krinkle: Remove obsolete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG)
[01:44:59] <wikibugs>	 (03CR) 10Krinkle: [C: 032] "Unused. Confirmed via code search and github @wikimedia." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG)
[01:46:21] <wikibugs>	 (03Merged) 10jenkins-bot: Remove obsolete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG)
[01:46:58] <Krinkle>	 Staging on mwdebug1002
[01:49:09] <wikibugs>	 (03CR) 10jenkins-bot: Remove obsolete $wgCentralPagePath CentralNotice global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416618 (owner: 10AndyRussG)
[01:50:29] <logmsgbot>	 !log krinkle@tin Synchronized wmf-config/CommonSettings.php: If8fdce707d (duration: 01m 17s)
[01:50:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:15:32] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 34.58, 33.33, 32.05
[03:16:02] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0
[03:19:32] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 34.97, 33.03, 32.15
[03:22:33] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 35.85, 32.85, 32.14
[03:38:52] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 52.55, 21.03, 15.07
[03:40:52] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 24.58, 22.25, 16.32
[05:12:52] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused
[05:13:13] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[05:19:41] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] keep intact output files from stubs/abstracts/logs around for retries [dumps] - 10https://gerrit.wikimedia.org/r/427684 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn)
[05:20:55] <logmsgbot>	 !log ariel@tin Started deploy [dumps/dumps@c2d3bb4]: keep completed stubs/abstracts/logs files around for retries
[05:20:59] <logmsgbot>	 !log ariel@tin Finished deploy [dumps/dumps@c2d3bb4]: keep completed stubs/abstracts/logs files around for retries (duration: 00m 04s)
[05:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:15] <wikibugs>	 (03PS8) 10ArielGlenn: set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177)
[05:23:13] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] set actimeo=0 on snapshot1006 nfs mount for the next dump run [puppet] - 10https://gerrit.wikimedia.org/r/427603 (https://phabricator.wikimedia.org/T191177) (owner: 10ArielGlenn)
[05:26:19] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148)
[05:27:44] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:28:57] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:29:15] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427844 (https://phabricator.wikimedia.org/T190148) (owner: 10Marostegui)
[05:30:34] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1110 for alter table (duration: 01m 17s)
[05:30:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:02] <marostegui>	 !log Deploy schema change on db1110 - T191519 T188299 T190148
[05:31:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:09] <stashbot>	 T191519: Schema change for rc_namespace_title_timestamp index - https://phabricator.wikimedia.org/T191519
[05:31:09] <stashbot>	 T190148: Change DEFAULT 0 for rev_text_id on production DBs - https://phabricator.wikimedia.org/T190148
[05:31:09] <stashbot>	 T188299: Schema change for refactored actor storage - https://phabricator.wikimedia.org/T188299
[05:31:32] <marostegui>	 !log Start atop on db1114 with "-R" option enabled - T192551
[05:31:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:37] <stashbot>	 T192551: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551
[05:32:22] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4145128 (10Marostegui) No errors running atop without "-R". I have just started it with "-R" to see if errors start showing up.
[05:32:22] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 35.71, 33.17, 32.10
[05:42:28] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846
[05:55:47] <_joe_>	 !log depooling mw1227 from live traffic for investigation
[05:55:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:52] <icinga-wm>	 PROBLEM - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:05:10] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846 (owner: 10Marostegui)
[06:05:13] <icinga-wm>	 PROBLEM - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:05:22] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[06:06:23] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846 (owner: 10Marostegui)
[06:07:12] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 0.21, 7.94, 23.31
[06:07:49] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1114 (duration: 01m 16s)
[06:07:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:59] <marostegui>	 !log Stop mysql db1114 for a reboot
[06:08:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:24] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427846 (owner: 10Marostegui)
[06:11:43] <icinga-wm>	 RECOVERY - Check systemd state on db1114 is OK: OK - running: The system is fully operational
[06:16:51] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4145151 (10Marostegui) As soon as it was started there was a spike of errors. So looks like -R is the offender here. {F17173512}  I have left atop started without -R and will leave it like that for the we...
[06:17:07] <wikibugs>	 10Operations, 10monitoring: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551#4145152 (10Marostegui) p:05Triage>03Normal
[06:17:27] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848
[06:22:17] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848 (owner: 10Marostegui)
[06:23:30] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848 (owner: 10Marostegui)
[06:23:45] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427848 (owner: 10Marostegui)
[06:25:19] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 (duration: 01m 15s)
[06:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:57] <ema>	 !log kafka::analytics remove strongswan leftovers T185136
[06:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:04] <stashbot>	 T185136: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136
[06:32:12] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[06:33:05] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850
[06:49:11] <wikibugs>	 (03PS2) 10Elukey: Set Debian Stretch as target OS for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/427702 (https://phabricator.wikimedia.org/T192557)
[06:49:55] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850 (owner: 10Marostegui)
[06:50:19] <wikibugs>	 (03CR) 10Elukey: [C: 032] Set Debian Stretch as target OS for all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/427702 (https://phabricator.wikimedia.org/T192557) (owner: 10Elukey)
[06:51:14] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850 (owner: 10Marostegui)
[06:51:29] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in main traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427850 (owner: 10Marostegui)
[06:54:11] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore main traffic original weight for db1114 (duration: 01m 15s)
[06:54:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:31] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145172 (10Gilles) There's no leading space:  ``` gilles@mwlog1001:~$ sed -n '58775,58777p' /srv/xenon/logs/daily/2018-04-17.all.log  api.php;{GET};A...
[06:59:45] <wikibugs>	 (03CR) 10Gilles: Filter out invalid records in xenon-log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[07:01:00] <wikibugs>	 10Puppet, 10Cloud-Services: Make changing puppetmasters for Labs instances more easy - https://phabricator.wikimedia.org/T152941#4145173 (10hashar) 05declined>03Open That is still broken. On project having a puppetmaster, any new instance ends up with a broken Puppet. The reason is firstboot.sh running pup...
[07:01:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852
[07:01:54] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852
[07:03:36] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#3917698 (10ema) We've had the following Icinga `UNKNOWN` on netmon2001 for the past 6 days:  ```Postgres Replication Lag - ERROR: FATAL: no pg_hba.conf entry for host "2620:0:86...
[07:04:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 (owner: 10Marostegui)
[07:05:14] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 (owner: 10Marostegui)
[07:05:33] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 31.95, 31.23, 32.09
[07:06:52] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1110 after alter table (duration: 01m 16s)
[07:06:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:30] <elukey>	 ema: esams 50x alarm incoming https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X
[07:08:15] <ema>	 elukey: thanks
[07:08:32] <vgutierrez>	 wow.. preemptive alarm service :P
[07:08:42] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854
[07:08:57] <moritzm>	 !log upgrading API servers in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[07:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:08] <ema>	 !log cp3032/cp3043: restart varnish-be due to mbox lag
[07:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:50] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1110" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427852 (owner: 10Marostegui)
[07:10:41] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854 (owner: 10Marostegui)
[07:11:04] <ema>	 elukey: FTR https://gerrit.wikimedia.org/r/#/c/426858/ should bring these things back to an acceptable level
[07:11:33] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 34.07, 32.18, 32.14
[07:11:33] <elukey>	 nice!
[07:11:53] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854 (owner: 10Marostegui)
[07:13:36] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1114 in API (duration: 01m 15s)
[07:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:02] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:14:44] <ema>	 ^ mitigated w/ manual restarts https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&var-site=All&var-cache_type=All&var-status_type=5&from=now-30m&to=now
[07:15:12] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:15:18] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145185 (10Gilles) I've confirmed that the PHP Redis client we use will incorrectly remove the leading space:  ``` gilles@terbium:~$ mwscript eval.ph...
[07:15:27] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427854 (owner: 10Marostegui)
[07:15:38] <wikibugs>	 (03Abandoned) 10Gilles: Filter out invalid records in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427816 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[07:15:53] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855
[07:23:12] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:24:02] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:30:50] <_joe_>	 !log upgrading hhvm on all jobrunners in eqiad
[07:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:44] <ema>	 !log cp3030: restart varnish-be due to mbox lag
[07:32:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:03] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:37:13] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:37:56] <akosiaris>	 !log upgrade qemu on ganeti2006 to 1:2.8+dfsg-3~bpo8+1 and migrate mwdebug2001 to it T150532
[07:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:02] <stashbot>	 T150532: Upgrade qemu on ganeti clusters to 2.7 - https://phabricator.wikimedia.org/T150532
[07:38:03] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855 (owner: 10Marostegui)
[07:38:22] <ema>	 !log cp3041: restart varnish-be due to mbox lag
[07:38:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:21] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855 (owner: 10Marostegui)
[07:39:28] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861
[07:39:34] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Give more traffic to db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427855 (owner: 10Marostegui)
[07:40:58] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Give more API traffic to db1114 (duration: 01m 17s)
[07:41:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:42:36] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861 (owner: 10Jcrespo)
[07:43:48] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861 (owner: 10Jcrespo)
[07:44:50] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2071 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427861 (owner: 10Jcrespo)
[07:48:36] <moritzm>	 !log upgrading app servers in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[07:48:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:50] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4145201 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['wdqs1005.eqiad.wmnet']...
[07:52:22] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[07:52:38] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2071 (duration: 01m 16s)
[07:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:12] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[07:54:02] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: Move coal from graphite machine(s) - https://phabricator.wikimedia.org/T159354#4145203 (10fgiunchedi) Thanks @Imarlier for the explanation and insight! Makes sense to me, the other thing I suggest checking is coal's whisper files aggregation/retetion peri...
[07:57:01] <jynus>	 !log starting reimage of db2071
[07:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:42] <icinga-wm>	 PROBLEM - DPKG on mw1303 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:58:37] <wikibugs>	 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4145207 (10fgiunchedi) >>! In T192124#4144670, @demon wrote: > That was part of that commit. I was kinda following the example set b...
[07:58:41] <icinga-wm>	 RECOVERY - DPKG on mw1303 is OK: All packages OK
[07:59:04] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868
[08:02:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868 (owner: 10Marostegui)
[08:03:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Target kafka jmx exporters by profiles instead of roles [puppet] - 10https://gerrit.wikimedia.org/r/427672 (owner: 10Ottomata)
[08:03:32] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868 (owner: 10Marostegui)
[08:05:30] <Hauskatze>	 Hi. Any op around for a quick script run (initSiteStats)?
[08:05:31] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1114 in API - T191996 (duration: 01m 16s)
[08:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:37] <stashbot>	 T191996: db1114 connection issues - https://phabricator.wikimedia.org/T191996
[08:08:47] <icinga-wm>	 PROBLEM - DPKG on mw1310 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[08:08:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 36.99, 34.28, 32.07
[08:09:11] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1114 in API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427868 (owner: 10Marostegui)
[08:09:47] <icinga-wm>	 RECOVERY - DPKG on mw1310 is OK: All packages OK
[08:11:57] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1311 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[08:12:57] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1311 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.006 second response time
[08:13:16] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4145228 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['wdqs1005.eqiad.wmnet'] ```  and were **ALL** successful.
[08:20:00] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[08:20:37] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145252 (10Gilles) Found the explanation... the "1" I was getting from the python client wasn't the message I sent (in my test above I had sent it be...
[08:21:01] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.001 second response time
[08:22:54] <wikibugs>	 (03PS2) 10Elukey: Changes needed for upgrading to Druid 0.10 [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata)
[08:23:03] <wikibugs>	 (03PS1) 10Gilles: Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249)
[08:23:20] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.003 second response time
[08:24:20] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.007 second response time
[08:25:57] <wikibugs>	 (03PS2) 10Gilles: Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249)
[08:27:10] <icinga-wm>	 PROBLEM - DPKG on mw1309 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[08:28:16] <icinga-wm>	 RECOVERY - DPKG on mw1309 is OK: All packages OK
[08:29:33] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review, 10Security: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145288 (10Gilles)
[08:29:56] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review, 10Security: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145291 (10Gilles)
[08:37:58] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler03/10990/" [puppet] - 10https://gerrit.wikimedia.org/r/355471 (https://phabricator.wikimedia.org/T164008) (owner: 10Ottomata)
[08:39:35] <elukey>	 !log restart hhvm on mw[1226,1232].eqiad.wmnet - high load
[08:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:49] <marostegui>	 !log Going to sanitize gorwiki euwikisource romdwikimedia inhwiki on db1095 - T189112 T189466 T187774 T184375
[08:39:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:59] <stashbot>	 T184375: Prepare and check storage layer for inhwiki - https://phabricator.wikimedia.org/T184375
[08:39:59] <stashbot>	 T189466: Prepare storage layer for  euwikisource - https://phabricator.wikimedia.org/T189466
[08:39:59] <stashbot>	 T187774: Prepare and check storage layer for romdwikimedia - https://phabricator.wikimedia.org/T187774
[08:39:59] <stashbot>	 T189112: Prepare and check storage layer for gorwiki - https://phabricator.wikimedia.org/T189112
[08:41:46] <moritzm>	 !log upgrading job runners in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[08:41:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:36] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 7.73, 15.19, 23.90
[08:49:16] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 8.24, 13.06, 23.69
[09:02:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[09:02:45] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Ignore Redis subscription message in xenon-log [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[09:06:15] <moritzm>	 !log upgrading video scalers in codfw to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[09:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'll check back on cronspam from xenon-log" [puppet] - 10https://gerrit.wikimedia.org/r/427870 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[09:12:15] <godog>	 volunteers for https://gerrit.wikimedia.org/r/c/427619/ ?
[09:12:19] <elukey>	 !log restart of mw apis showing ~50% cpu utilization as precaution before the weekend - mw[1224,1225,1228,1230,1231,1233-1235,1276-1283,1286,1312,1313,1315,1316,1341,1343,1344,1347,1348]*
[09:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:53] <elukey>	 moritzm: --^ (-b 2)
[09:13:24] <eddiegp>	 Hauskatze: For what wiki?
[09:13:37] <eddiegp>	 It ran on all wikis last Sunday (we now have a cron for that).
[09:13:59] <Hauskatze>	 for the newly created ones, but if there's a crone I'll let folks know on the task
[09:14:12] <Hauskatze>	 they complain that the article counts are still at 0 while everything has been imported
[09:14:23] <Hauskatze>	 not sure if we need an initial manual run though
[09:14:59] <eddiegp>	 No, it shouldn't need one. It should update automatically. That said, this didn't work that great in the past (hence the cron).
[09:15:53] <eddiegp>	 Hauskatze: The next cron run will be on 30th iirc. Up to you whether to wait that long. If not, I'd just sign it up for SWAT, should be easy enough.
[09:17:07] <Hauskatze>	 not my wikis, but maybe I'll add a note on the swat page for a dev to do that in their spare time
[09:18:23] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877
[09:19:46] <icinga-wm>	 PROBLEM - Disk space on labtestvirt2001 is CRITICAL: DISK CRITICAL - /home/aborrero/mnt is not accessible: Permission denied
[09:20:38] <Hauskatze>	 arturo: ^
[09:22:12] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877
[09:22:14] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878
[09:22:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo)
[09:22:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 (owner: 10Jcrespo)
[09:22:37] <arturo>	 I don't fully understand the purpose of that check
[09:22:46] <icinga-wm>	 RECOVERY - Disk space on labtestvirt2001 is OK: DISK OK
[09:23:54] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 (owner: 10Jcrespo)
[09:25:06] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878
[09:26:39] <_joe_>	 arturo: a check on free disk space?
[09:27:12] <arturo>	 you don't need concrete directory permissions for that
[09:27:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo)
[09:27:36] <_joe_>	 oh so you don't understand why it's failing
[09:28:04] <arturo>	 I don't understand why icinga cares about a directory created by root in my home directory :-P
[09:28:28] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo)
[09:29:16] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db2071 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427877 (owner: 10Jcrespo)
[09:32:18] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879
[09:32:54] <wikibugs>	 (03PS1) 10Jcrespo: Revert "install_server: Allow stretch reimage of db207* except db2079" [puppet] - 10https://gerrit.wikimedia.org/r/427880
[09:33:23] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2071, depool db2070 (duration: 01m 16s)
[09:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:35] <eddiegp>	 You need r-x for a directory to get the size of its contents, at least with `du`.
[09:35:42] <wikibugs>	 10Operations, 10monitoring: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4145427 (10fgiunchedi) Thanks @bblack for taking a look! Looks like an "heavy" query from `varnish-failed-fetches` drove disk utilization to 100% starting at 04/19T22:17 and things snowballed from there,...
[09:35:57] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: prometheus on bast3002 misbehaving - https://phabricator.wikimedia.org/T192610#4145430 (10fgiunchedi)
[09:36:31] <wikibugs>	 (03PS1) 10Jcrespo: install_server: Revert patch allowing reimage of db207* hosts [puppet] - 10https://gerrit.wikimedia.org/r/427881
[09:36:38] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "install_server: Allow stretch reimage of db207* except db2079" [puppet] - 10https://gerrit.wikimedia.org/r/427880 (owner: 10Jcrespo)
[09:39:22] <wikibugs>	 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4145433 (10mmodell)
[09:39:56] <wikibugs>	 10Operations, 10Deployments, 10Patch-For-Review, 10Release, 10Release-Engineering-Team (Kanban): Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124#4128785 (10mmodell) >>! In T192124#4145207, @fgiunchedi wrote: > I don't know about conftool but AFAICS scap doesn't contain archite...
[09:41:07] <wikibugs>	 (03Abandoned) 10Fdans: Puppetize cron job archiving old MaxMind databases [puppet] - 10https://gerrit.wikimedia.org/r/425247 (https://phabricator.wikimedia.org/T136732) (owner: 10Fdans)
[09:41:49] <moritzm>	 !log upgrading mwdebug servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[09:41:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:59] <jynus>	 !log starting reimage of db2070
[09:42:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:43] <wikibugs>	 (03CR) 10Volans: "> That requires the cumin master to use Stretch. On Jessie there are" [puppet] - 10https://gerrit.wikimedia.org/r/419131 (https://phabricator.wikimedia.org/T188112) (owner: 10Volans)
[09:44:57] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4145439 (10Gehel) >>! In T185504#4145176, @ema wrote: > It's also unclear to me whether `UNKNOWN` is the proper severity for this issue, it should perhaps be `CRITICAL` instead....
[09:45:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0
[09:51:10] <moritzm>	 !log upgrading deployment servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[09:51:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Bump scap version to 3.8.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/427535 (https://phabricator.wikimedia.org/T192124) (owner: 1020after4)
[09:58:28] <godog>	 !log upload scap 3.8.0-2 - T192124
[09:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:34] <stashbot>	 T192124: Deploy Scap 3.8.0 to production - https://phabricator.wikimedia.org/T192124
[09:59:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] install_server: Revert patch allowing reimage of db207* hosts [puppet] - 10https://gerrit.wikimedia.org/r/427881 (owner: 10Jcrespo)
[09:59:41] <wikibugs>	 (03PS3) 10Filippo Giunchedi: Bump scap version to 3.8.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/427535 (https://phabricator.wikimedia.org/T192124) (owner: 1020after4)
[09:59:53] <godog>	 twentyafterfour: scap upgraded on tin already, I'll merge the puppet patch
[10:00:09] <twentyafterfour>	 godog: thanks!
[10:00:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] Bump scap version to 3.8.0-2 [puppet] - 10https://gerrit.wikimedia.org/r/427535 (https://phabricator.wikimedia.org/T192124) (owner: 1020after4)
[10:01:42] <godog>	 jynus: merging your change too
[10:02:02] <jynus>	 thanks
[10:03:52] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145475 (10Aklapper)
[10:17:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-nutcracker-exporter [puppet] - 10https://gerrit.wikimedia.org/r/427884 (https://phabricator.wikimedia.org/T135991)
[10:20:00] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424243 (https://phabricator.wikimedia.org/T135991)
[10:20:55] <wikibugs>	 10Operations, 10Performance-Team, 10Patch-For-Review: /usr/local/bin/xenon-generate-svgs and flamegraph.pl cronspam - https://phabricator.wikimedia.org/T169249#4145531 (10fgiunchedi) So far so good, the restart didn't cause any cronspam this time around
[10:31:04] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4145535 (10Gehel) All wdqs servers are now running RAID on Debian Stretch. Data is fully reloaded.
[10:46:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991)
[10:56:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:56:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-nutcracker-exporter [puppet] - 10https://gerrit.wikimedia.org/r/427884 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:56:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Enable base::service_auto_restart for prometheus-snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/424243 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[11:06:55] <moritzm>	 !log installing tiff security updates on trusty
[11:07:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:47] <elukey>	 !log reimage analytics1068 to Debian Stretch - T192557
[11:27:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:27:53] <stashbot>	 T192557: Reimage the Debian Jessie Analytics worker nodes to Stretch. - https://phabricator.wikimedia.org/T192557
[11:53:11] <moritzm>	 !log installing apache security updates on netmon1002/2001
[11:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879 (owner: 10Jcrespo)
[11:56:38] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879 (owner: 10Jcrespo)
[12:05:31] <moritzm>	 !log upgrading apache on einsteinium/icinga.wikimedia.org
[12:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:11] <moritzm>	 ^ completed, icinga back to normal
[12:06:35] <moritzm>	 !log installing apache security updates on video scalers
[12:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:12] <jynus>	 I used git as root by error on tin, I think I cleaned up permissions after that, but ping me if you see something weird
[12:15:49] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2070 (duration: 01m 17s)
[12:15:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:08] <jynus>	 !log upgrading and restarting dbstore2002
[12:18:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:48] <moritzm>	 !log upgrading apache on auth* servers
[12:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:20] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902
[13:00:35] <moritzm>	 !log installing zsh security updates on trusty servers
[13:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:37] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904
[13:16:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904 (owner: 10Jcrespo)
[13:25:00] <moritzm>	 !log upgrading mysql (as shipped in Debian) on bohrium
[13:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:10] <Reedy>	 Dereckson: I've marked https://phabricator.wikimedia.org/T171013 as UBN to get Wikidata teams attention to get the mess fixed up
[13:27:03] <hoo>	 Reedy: Will have a look
[13:27:15] <Reedy>	 hoo: I'm sure you're aware of it when you look :P
[13:27:15] <hoo>	 I though I already documented that on Wikitech like ages ago
[13:27:32] <Reedy>	 Every time we create a wiki... We get to this stage of stuff still not working right
[13:27:42] <Reedy>	 So there's either stuff missing... Or the docs are completely wrong :)
[13:27:56] <hoo>	 I have a meeting now, but will have a look after (so surely today)
[13:28:24] <Reedy>	 Thanks :)
[13:32:14] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905
[13:32:39] <anomie>	 jynus, marostegui: Any objection to me running the CREATE TABLE to recreate the `slots` table now? (following up T190153, and T184446#4143097)
[13:32:39] <stashbot>	 T184446: Configure Toolforge replica views and dumps for the new MCR tables - https://phabricator.wikimedia.org/T184446
[13:32:39] <stashbot>	 T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) - https://phabricator.wikimedia.org/T190153
[13:34:09] <jynus>	 anomie: normally we do not like friday deploys, but we can make an exception
[13:34:39] <anomie>	 I can wait if you'd rather. 
[13:34:47] <jynus>	 it is ok, I am around
[13:35:01] <jynus>	 better now than in 2 hours, were I will not be aroud :-)
[13:35:29] <anomie>	 !log (re-)creating `slots` table on all wikis, following up T190153 and T184446#4143097
[13:35:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:49] <jynus>	 normally those are totally safe, but there is a small chance of one of those not being deleted and breaking replication
[13:38:03] <jynus>	 *having being deleted
[13:44:04] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Create /run/mysqld on server start with tmpfiles.d [puppet] - 10https://gerrit.wikimedia.org/r/427902
[13:44:06] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Do not create /srv/sqldata and /srv/tmp if datadir is false [puppet] - 10https://gerrit.wikimedia.org/r/427904
[13:45:19] <anomie>	 jynus: My run of creations is done now, FYI.
[13:45:26] <jynus>	 thanks
[13:55:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905 (owner: 10Jcrespo)
[13:56:48] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905 (owner: 10Jcrespo)
[13:59:08] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2086 (duration: 01m 13s)
[13:59:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:29] <jynus>	 !log upgrade and restart db2086
[14:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:08] <moritzm>	 !log upgrading labweb* servers to MEMC_VAL_COMPRESSION_ZLIB enabled HHVM build
[14:02:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:42] <wikibugs>	 (03PS2) 10Andrew Bogott: Wikitech: change maintenance jobs to use the 'wikitech' dblist [puppet] - 10https://gerrit.wikimedia.org/r/427812 (https://phabricator.wikimedia.org/T189542)
[14:03:50] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Wikitech: change maintenance jobs to use the 'wikitech' dblist [puppet] - 10https://gerrit.wikimedia.org/r/427812 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[14:08:46] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542)
[14:13:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[14:14:44] <wikibugs>	 (03Merged) 10jenkins-bot: Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[14:16:52] <logmsgbot>	 !log andrew@tin Synchronized dblists: Purging obsolete silver.dblist (duration: 01m 17s)
[14:17:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:16] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Repool db2086, depool db2087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427912
[14:37:17] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2086, depool db2087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427912 (owner: 10Jcrespo)
[14:39:04] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913
[14:41:50] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2086, depool db2087 (duration: 01m 16s)
[14:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:44] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.115:9042 on restbase1010 is CRITICAL: connect to address 10.64.0.115 and port 9042: Connection refused eevans Decommissioned
[14:42:44] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra-b SSL 10.64.0.115:7001 on restbase1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned
[14:47:00] <wikibugs>	 (03PS1) 10Andrew Bogott: Add 'wikitech' section for wikitech db hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542)
[14:55:53] <wikibugs>	 (03CR) 10Jcrespo: "Let's deploy better next week- so we do not accidentally break the other wikis." (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[14:55:56] <wikibugs>	 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Programmatic generation of grafana dashboards - https://phabricator.wikimedia.org/T171482#4146092 (10fgiunchedi)
[14:57:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for clamav-freshclam [puppet] - 10https://gerrit.wikimedia.org/r/427916 (https://phabricator.wikimedia.org/T135991)
[15:03:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913 (owner: 10Jcrespo)
[15:04:52] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913 (owner: 10Jcrespo)
[15:06:27] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Repool db2086, depool db2087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427912 (owner: 10Jcrespo)
[15:07:03] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2087 (duration: 01m 16s)
[15:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:26] <wikibugs>	 (03CR) 10Andrew Bogott: "So... you want both eqiad and codfw sites to point to the codfw db server, and mark eqiad and read-only?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[15:12:48] <wikibugs>	 (03CR) 10Jcrespo: "No, eqiad should point to eqiad, and codfw to codfw, and codfw should be set as read only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427915 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[15:13:08] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: Depool db2070 for upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427879 (owner: 10Jcrespo)
[15:13:12] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2070 for upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427878 (owner: 10Jcrespo)
[15:13:18] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Depool db2086 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427905 (owner: 10Jcrespo)
[15:13:22] <wikibugs>	 (03CR) 10jenkins-bot: Remove silver.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427910 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[15:13:27] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Repool db2087 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427913 (owner: 10Jcrespo)
[15:19:36] <wikibugs>	 (03CR) 10Eevans: [C: 031] Enable base::service_auto_restart for cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/427889 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:37:01] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Update mysql 8.0 package [software] - 10https://gerrit.wikimedia.org/r/427926
[16:16:14] <wikibugs>	 (03PS1) 10Herron: WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318)
[16:16:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) (owner: 10Herron)
[16:19:51] <wikibugs>	 (03PS2) 10Herron: WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318)
[16:25:07] <dcausse>	 I'll be deploying a fix on group0 wikis for CirrusSearch
[16:26:24] <thcipriani>	 ^ blessed by releng, FYI
[16:26:51] <greg-g>	 reminder: group0 is testwikis + mw.org, so our "safe" group
[16:39:38] <mutante>	 subbu: feel like testing the upload of a test file to the parsoid release archive ?
[16:40:02] <subbu>	 mutante, sure. 
[16:40:10] <subbu>	 let me look at the ticket
[16:40:37] <mutante>	 subbu: let me TLDR it for you
[16:40:45] <subbu>	 k :)
[16:40:46] <mutante>	 we now have https://releases.wikimedia.org/parsoid/
[16:41:00] <mutante>	 and there is a new group called releasers-parsoid
[16:41:10] <mutante>	 and you are the only member right now
[16:41:29] <mutante>	 scp sometestfile releases1001.eqiad.wmnet:/srv/org/wikimedia/releases/parsoid/ 
[16:41:31] <subbu>	 could you add arlo and scott to that group as well?
[16:42:37] <mutante>	 sure, i would just handle it as access request which takes 3 business days... doesnt need ops meeting though
[16:43:14] <subbu>	 mutante, success https://releases.wikimedia.org/parsoid/
[16:44:42] <logmsgbot>	 !log dcausse@tin Synchronized php-1.31.0-wmf.30/extensions/CirrusSearch/: T192609: Do not propagate Elastica doc modifications out of DataSender (duration: 01m 34s)
[16:44:43] <mutante>	 i dont see a file, is the index cached.. looking on releases1001.eqiad.wmnet
[16:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:52] <stashbot>	 T192609: Search backend error during sending {numBulk} documents to the {index} index(s) after {tookMs}: {error_message} - https://phabricator.wikimedia.org/T192609
[16:45:30] <mutante>	 subbu: i see them in the file system but by browser keeps showing me empty index.. caching
[16:45:48] <subbu>	 my browser shows them just fine.
[16:45:52] <subbu>	 ctrl+r
[16:46:07] <subbu>	 mutante, so, anything else you want me to test / do?
[16:46:10] <mutante>	 yea, i also tried adding ?foo and stuff
[16:46:13] <mutante>	 which usuually works
[16:46:25] <mutante>	 subbu: no, it's success then. you have a place to archive the files, right
[16:46:31] <subbu>	 yup. thanks. :)
[16:46:40] <mutante>	 resolves the tickets except i will add a subtask to add more people and they will have it next week, k
[16:46:47] <mutante>	 welcome
[16:46:52] <subbu>	 updated https://www.mediawiki.org/w/index.php?title=Parsoid%2FReleases&type=revision&diff=2761835&oldid=2744784 as well.
[16:46:58] <mutante>	 nice
[16:47:30] <mutante>	 subbu: there is just one thing to remember. it could happen in the future that the releases server is switched to codfw
[16:47:45] <subbu>	 ok
[16:47:49] <mutante>	 there are 2 backends, 1001 in eqiad and 2001 in codfw
[16:47:57] <mutante>	 and i setup automatic rsync between them
[16:48:04] <mutante>	 so if you upload to 1001 it will sync over 
[16:48:14] <mutante>	 and we have active-active setup and serve from both
[16:48:23] <subbu>	 ok. sounds good.
[16:48:30] <mutante>	 which releases server is the currently "active" one is configured in Hiera in common.yaml
[16:49:18] <mutante>	 if that is changed then the rsync would also automatically change direction   and 2001 would be the source
[16:49:32] <subbu>	 nice.
[16:49:46] <mutante>	 the shell access is based on role, so applies to both.
[16:49:48] <mutante>	 ok, that's it :
[16:49:49] <mutante>	 :)
[16:51:22] <subbu>	 alright! time for lunch then. :)
[16:52:31] <wikibugs>	 (03PS1) 10Andrew Bogott: Rename 'm5' section to 'wikitech' and add explicit hostnames. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542)
[16:53:13] <wikibugs>	 (03CR) 10Andrew Bogott: "I believe the latest patch implements most of your suggestions.  I don't know how to mark a db server as read-only.  Also, labtestwikitech" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427930 (https://phabricator.wikimedia.org/T189542) (owner: 10Andrew Bogott)
[16:53:49] <thcipriani>	 dcausse: thanks for the quick patch and the deploy! watching https://logstash.wikimedia.org/goto/042b2b3677fa27897418b10ffa49a989 I haven't seen any new instances of the error. Everything look good on your side?
[16:54:19] <dcausse>	 thcipriani: yw, me neither, I'll wait a bit more but everything looks good to me
[16:55:01] <thcipriani>	 awesome :)
[17:03:33] <wikibugs>	 10Operations, 10Parsoid, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide an archive endpoint for older Parsoid debs (on releases.wikimedia.org or elsewhere) - https://phabricator.wikimedia.org/T150672#4146464 (10Dzahn) 05Open>03Resolved 12:43 < subbu> mutante, success http...
[17:04:49] <andrewbogott>	 !log rebooting labvirt1021 and 1022
[17:04:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:03] <wikibugs>	 (03PS3) 10Herron: WIP: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318)
[17:12:57] <wikibugs>	 (03Abandoned) 10Chad: scap clean: Use --delete-excluded [mediawiki-config] - 10https://gerrit.wikimedia.org/r/424645 (https://phabricator.wikimedia.org/T157030) (owner: 10Chad)
[17:14:25] <wikibugs>	 (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler02/10993/" [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318) (owner: 10Herron)
[17:14:40] <wikibugs>	 (03PS4) 10Herron: puppetmaster: remove support for puppetdb 2.x [puppet] - 10https://gerrit.wikimedia.org/r/427928 (https://phabricator.wikimedia.org/T190318)
[17:16:24] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load
[17:16:24] <icinga-wm>	 ACKNOWLEDGEMENT - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load
[17:16:24] <icinga-wm>	 ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load
[17:16:40] <icinga-wm>	 ACKNOWLEDGEMENT - Apache HTTP on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load
[17:16:40] <icinga-wm>	 ACKNOWLEDGEMENT - HHVM rendering on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load
[17:16:40] <icinga-wm>	 ACKNOWLEDGEMENT - Nginx local proxy to apache on mw1227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds daniel_zahn per SAL: depooled for investigation of high load
[17:17:55] <mutante>	 !log phab2001 - upgrading apache, openssl, mysql-common
[17:18:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:05] <wikibugs>	 10Operations, 10Ops-Access-Requests: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146497 (10mepps) @Dzahn my wikitech username is MEpps and here's the public key:  ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDMquf0ywthSAksqXIMATkeQt8ui6B2JxWES4zEMQVYtPlVUNnFGQyAbYN/Fe...
[17:19:26] <mutante>	 twentyafterfour: i'm installing apache updates on phab2001 .. and then also openssl and mysql-common. no issues so far.  would also like to hit phab1001 at some point
[17:26:01] <mutante>	 !log phabricator (phab1001) - upgrading Apache, openssl, mysql-common
[17:26:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:50] <mutante>	 !log phabricator - restarted apache
[17:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:40] <icinga-wm>	 PROBLEM - puppet last run on labvirt1022 is CRITICAL: Return code of 255 is out of bounds
[17:35:36] <no_justification>	 !log gerrit: update mysql-client and deps 5.5.59 -> 5.5.60
[17:35:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:36:00] <icinga-wm>	 PROBLEM - configured eth on labvirt1021 is CRITICAL: Return code of 255 is out of bounds
[17:37:41] <icinga-wm>	 PROBLEM - dhclient process on labvirt1021 is CRITICAL: Return code of 255 is out of bounds
[17:38:29] <mutante>	 ^ https://phabricator.wikimedia.org/T183937   
[17:38:32] <icinga-wm>	 PROBLEM - ensure kvm processes are running on labvirt1021 is CRITICAL: Return code of 255 is out of bounds
[17:39:07] <mutante>	 these are new but " labvirt1021 has puppet signed but wont run"
[17:41:01] <andrewbogott>	 I'm setting them up now, they apparently need to make some noise as they come up
[17:41:04] <andrewbogott>	 I downtimed both
[17:41:32] <mutante>	 ah :) cool
[17:41:57] <logmsgbot>	 !log imarlier@tin Started deploy [performance/coal@99db58f]: coal - update to submit via graphite.  Not yet active, requires puppet changes
[17:42:01] <logmsgbot>	 !log imarlier@tin Finished deploy [performance/coal@99db58f]: coal - update to submit via graphite.  Not yet active, requires puppet changes (duration: 00m 04s)
[17:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:41] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1943 bytes in 0.103 second response time
[17:45:42] <wikibugs>	 (03CR) 10Thcipriani: [C: 031] mwdeploy: Ensure home directory exists on all machines [puppet] - 10https://gerrit.wikimedia.org/r/427188 (owner: 10Chad)
[17:47:41] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1955 bytes in 0.101 second response time
[17:47:51] <icinga-wm>	 RECOVERY - dhclient process on labvirt1021 is OK: PROCS OK: 0 processes with command name dhclient
[17:48:01] <icinga-wm>	 RECOVERY - configured eth on labvirt1021 is OK: OK - interfaces up
[17:52:40] <icinga-wm>	 RECOVERY - puppet last run on labvirt1022 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[17:54:04] <wikibugs>	 (03PS1) 10Urbanecm: Add logos for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669)
[17:56:38] <wikibugs>	 (03PS2) 10Urbanecm: Add logos for gorwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427940 (https://phabricator.wikimedia.org/T192669)
[17:58:00] <wikibugs>	 (03PS1) 10Urbanecm: Change timezone for napwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427941 (https://phabricator.wikimedia.org/T192568)
[18:01:56] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): rack/setup/install labvirt102[12] - https://phabricator.wikimedia.org/T183937#4146628 (10Andrew) 05Open>03Resolved Both these systems are now puppetized and ready for testing.
[18:04:40] <icinga-wm>	 RECOVERY - ensure kvm processes are running on labvirt1021 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm
[18:05:52] <wikibugs>	 (03PS1) 10Catrope: Enable internationalized maps on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427943
[18:08:19] <icinga-wm>	 ACKNOWLEDGEMENT - configured eth on labvirt1021 is CRITICAL: eth3 reporting no carrier. andrew bogott T192682 Why is this even
[18:08:19] <icinga-wm>	 ACKNOWLEDGEMENT - configured eth on labvirt1022 is CRITICAL: eth3 reporting no carrier. andrew bogott T192682 Why is this even
[18:11:43] <wikibugs>	 (03PS1) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472)
[18:12:43] <wikibugs>	 (03PS2) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472)
[18:12:52] <wikibugs>	 (03PS3) 10Dzahn: admins: create shell account for mepps, add to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/427944 (https://phabricator.wikimedia.org/T192472)
[18:16:50] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146686 (10Dzahn) @mepps Thanks! looks good. I made the needed puppet code change and uploaded to Gerrit. The next step will be getting this reviewed/merged (i...
[18:17:21] <wikibugs>	 (03PS1) 10Imarlier: graphite: add a specific retention rule for coal metrics [puppet] - 10https://gerrit.wikimedia.org/r/427945 (https://phabricator.wikimedia.org/T191994)
[18:20:34] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4146697 (10Dzahn) a:03Dzahn
[18:23:07] <mutante>	 !log add LDAP user "tieu" to group "wmde" (T192256)
[18:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:23:14] <stashbot>	 T192256: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256
[18:23:56] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4146708 (10Dzahn)
[18:24:42] <wikibugs>	 10Operations, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Add Tim_WMDE to the ldap/wmde group - https://phabricator.wikimedia.org/T192256#4132854 (10Dzahn) 05Open>03Resolved @Tim_WMDE You have been added to the group. tieu is a member of wmde. It should work now.
[18:27:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "per "attempting to aggregate will corrupt data" and "PITA having to go through the puppet merge process" - just doing it" [puppet] - 10https://gerrit.wikimedia.org/r/427945 (https://phabricator.wikimedia.org/T191994) (owner: 10Imarlier)
[18:29:19] <wikibugs>	 (03CR) 10Dzahn: [C: 04-2] "still want to use "nihonium" but needs to wait until the right, newer, WMF asset number is assigned" [dns] - 10https://gerrit.wikimedia.org/r/426295 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn)
[18:39:08] <marlier>	 mutante: thanks for the quick merge on that graphite config change. 
[18:46:36] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146737 (10mepps) Thanks @Dzahn!
[18:49:36] <mutante>	 marlier: welcome:) i saw the PITA comment 
[18:49:47] <mutante>	 i ran puppet on graphite1001 but that was it
[18:50:51] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4146745 (10Dzahn) a:03Dzahn
[18:51:06] <marlier>	 I saw that -- it all looks good.
[18:51:24] <mutante>	 great!
[19:19:22] <wikibugs>	 10Operations, 10Ops-Access-Requests: add arlo and scott to parsoid releasers admin group - https://phabricator.wikimedia.org/T192684#4146838 (10Dzahn) p:05Triage>03Normal
[19:23:01] <wikibugs>	 (03PS1) 10Dzahn: admins: add arlolra, cscott to releasers-parsoid [puppet] - 10https://gerrit.wikimedia.org/r/427954 (https://phabricator.wikimedia.org/T192684)
[19:32:31] <wikibugs>	 (03PS1) 10Urbanecm: Temp rate limit for arwiki due to mass vandalism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668)
[19:33:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Temp rate limit for arwiki due to mass vandalism [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm)
[20:10:18] <wikibugs>	 (03CR) 10Kaldari: "@Ladsgroup: Why does this need a cronjob? Isn't it just a 1-time script?" [puppet] - 10https://gerrit.wikimedia.org/r/424300 (https://phabricator.wikimedia.org/T189596) (owner: 10Ladsgroup)
[20:26:37] <wikibugs>	 (03PS1) 10Andrew Bogott: labvirt1021 and 1022: remove special eth interface juggling [puppet] - 10https://gerrit.wikimedia.org/r/428006 (https://phabricator.wikimedia.org/T192682)
[20:29:13] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: reimage wdqs1003 / wdqs200[123] with RAID - https://phabricator.wikimedia.org/T189192#4146959 (10Smalyshev) Great, thanks!
[20:40:41] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146985 (10Krinkle)
[20:40:51] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146974 (10Krinkle)
[20:41:46] <wikibugs>	 (03PS2) 10Andrew Bogott: labvirt1021 and 1022: remove special eth interface juggling [puppet] - 10https://gerrit.wikimedia.org/r/428006 (https://phabricator.wikimedia.org/T192682)
[20:41:48] <wikibugs>	 (03PS1) 10Andrew Bogott: labvirt1021 and 1022: Move back to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/428007
[20:43:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labvirt1021 and 1022: remove special eth interface juggling [puppet] - 10https://gerrit.wikimedia.org/r/428006 (https://phabricator.wikimedia.org/T192682) (owner: 10Andrew Bogott)
[20:43:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labvirt1021 and 1022: Move back to Jessie [puppet] - 10https://gerrit.wikimedia.org/r/428007 (owner: 10Andrew Bogott)
[20:45:58] <andrewbogott>	 !log re-imaging labvirt1021 and 1022 as Jessie
[20:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:56:02] <wikibugs>	 (03CR) 10Rxy: Temp rate limit for arwiki due to mass vandalism (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/427956 (https://phabricator.wikimedia.org/T192668) (owner: 10Urbanecm)
[21:00:06] <wikibugs>	 (03PS1) 10Thcipriani: Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936)
[21:00:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936) (owner: 10Thcipriani)
[21:01:21] <wikibugs>	 (03PS2) 10Thcipriani: Pipeline: setup minikube in CI [puppet] - 10https://gerrit.wikimedia.org/r/428010 (https://phabricator.wikimedia.org/T188936)
[21:10:45] <wikibugs>	 (03PS1) 10Andrew Bogott: labvirt1021 and 1022:  Move to Jessie, second attempt [puppet] - 10https://gerrit.wikimedia.org/r/428019
[21:12:27] <wikibugs>	 (03PS2) 10Andrew Bogott: labvirt1021 and 1022:  Move to Jessie, second attempt [puppet] - 10https://gerrit.wikimedia.org/r/428019
[21:13:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labvirt1021 and 1022:  Move to Jessie, second attempt [puppet] - 10https://gerrit.wikimedia.org/r/428019 (owner: 10Andrew Bogott)
[21:18:15] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147059 (10Krinkle) @thcipriani Is this with or without translation cache (TC) and JIT?
[21:27:45] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147070 (10thcipriani) This uses the value from `/etc/hhvm/php.ini`: `hhvm.jit = false`.  Played with `-vEval.Jit=1` yesterday and it was quite a bit s...
[21:56:06] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "might not be what was intended, needs clarification on ticket" [puppet] - 10https://gerrit.wikimedia.org/r/427833 (https://phabricator.wikimedia.org/T177850) (owner: 10Dzahn)
[21:56:59] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "needs some communication effort to tell all the users of this module that there is a change coming" [puppet] - 10https://gerrit.wikimedia.org/r/415510 (owner: 10Dzahn)
[21:58:08] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack: Pin a bunch of packages for mitaka/jessie [puppet] - 10https://gerrit.wikimedia.org/r/428023 (https://phabricator.wikimedia.org/T192162)
[21:58:45] <wikibugs>	 (03CR) 10Dzahn: "wasn't sure where what to compile this on but i see cache::misc has a superset director with thorium.eqiad.wmnet as a single backend" [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn)
[21:59:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] openstack: Pin a bunch of packages for mitaka/jessie [puppet] - 10https://gerrit.wikimedia.org/r/428023 (https://phabricator.wikimedia.org/T192162) (owner: 10Andrew Bogott)
[22:00:50] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10MW-1.32-release-notes (WMF-deploy-2018-04-24 (1.32.0-wmf.1)), 10Patch-For-Review: deployment-prep has jobqueue issues - https://phabricator.wikimedia.org/T192473#4147110 (10EddieGP)
[22:04:49] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146974 (10MarcoAurelio) Maybe we should change it to `wiki@wikimedia.beta.wmflabs.org`. Does that address need to exist?
[22:08:01] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-Configuration, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4147125 (10Dzahn)
[22:10:50] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-Configuration, 10Release-Engineering-Team: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4146974 (10Dzahn) >>! In T192686#4147116, @MarcoAurelio wrote: > Maybe we should change it to `wiki@wikim...
[22:19:46] <wikibugs>	 (03Draft1) 10MarcoAurelio: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686)
[22:19:50] <wikibugs>	 (03PS2) 10MarcoAurelio: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686)
[22:22:19] <wikibugs>	 (03PS1) 10Andrew Bogott: openstack compute: Fix monitoring of kvm processes on Debian [puppet] - 10https://gerrit.wikimedia.org/r/428027
[22:24:54] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "208.80.155.135" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio)
[22:26:58] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-Configuration, 10Release-Engineering-Team, 10Patch-For-Review: Beta Cluster sends password reset mails with prod address - https://phabricator.wikimedia.org/T192686#4147140 (10MarcoAurelio) With regards to my patch above, we should investigate if t...
[22:29:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] openstack compute: Fix monitoring of kvm processes on Debian [puppet] - 10https://gerrit.wikimedia.org/r/428027 (owner: 10Andrew Bogott)
[22:55:37] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147196 (10Krinkle) @thcipriani Hm.. these are seconds though, as opposed to minutes. Is there something different about these commands? I'm asking bec...
[22:58:07] <wikibugs>	 10Operations, 10Scap, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes - https://phabricator.wikimedia.org/T191921#4147197 (10thcipriani) >>! In T191921#4147196, @Krinkle wrote: > @thcipriani Hm.. these are seconds though, as opposed to minutes. Is there something d...
[22:59:13] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] Add cirrussearch settings for wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/419367 (https://phabricator.wikimedia.org/T182717) (owner: 10DCausse)
[22:59:25] <wikibugs>	 (03PS3) 10Krinkle: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio)
[22:59:59] <wikibugs>	 (03CR) 10Krinkle: "Moved it nearby the other mail-related setting. There are three in total. The third one is wgEmergencyContact, but it seems that one is no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio)
[23:00:02] <wikibugs>	 (03CR) 10Krinkle: [C: 031] labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio)
[23:33:07] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 347 MB (3% inode=75%)
[23:55:40] <wikibugs>	 (03CR) 10Krinkle: [C: 032] labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio)
[23:56:57] <wikibugs>	 (03Merged) 10jenkins-bot: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio)
[23:58:54] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: Depool labsdb1010 for MCR table additions [puppet] - 10https://gerrit.wikimedia.org/r/428037 (https://phabricator.wikimedia.org/T184446)
[23:59:07] <wikibugs>	 (03CR) 10jenkins-bot: labs: use a $wgPasswordSender different from the production one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/428026 (https://phabricator.wikimedia.org/T192686) (owner: 10MarcoAurelio)
[23:59:59] <p858snake>	 Krinkle: may also want to check [[MediaWiki:Emailsender]] on the relevant wikis as well as that sets the from name for the email