[00:05:38] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[00:27:12] <wikibugs>	 (03PS1) 10Dzahn: icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782)
[00:27:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[00:30:57] <wikibugs>	 (03PS2) 10Dzahn: icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782)
[00:31:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] icinga/check_ssl: add support for stretch, rename it to check_tls [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[01:02:48] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:05:17] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[01:09:58] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 46 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:20:07] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:27:18] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 87 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:28:57] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 65 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:28:57] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 52 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:30:18] <wikibugs>	 (03CR) 10Krinkle: [C: 032] wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[01:32:03] <wikibugs>	 (03Merged) 10jenkins-bot: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[01:33:58] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:33:58] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 20 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:38:36] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: T176370 - I5e7e5d167d517 (duration: 00m 55s)
[01:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:38:39] <stashbot>	 T176370: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370
[01:39:35] <wikibugs>	 (03CR) 10jenkins-bot: wmfSetupEtcd: Correctly initialize the local cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464117 (https://phabricator.wikimedia.org/T176370) (owner: 10Giuseppe Lavagetto)
[01:41:18] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 65 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:51:27] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[01:58:47] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 101 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[02:00:47] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 102 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[02:07:14] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/tests/phpunit/includes/page/: Ib211d98498f (duration: 00m 49s)
[02:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:31] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/page/WikiPage.php: T203942 - Ib211d98498f (duration: 00m 49s)
[02:10:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:10:34] <stashbot>	 T203942: Fatal exception when attempting to view wiki page that should redirect to Media-namespace ("NS_MEDIA is a virtual namespace; use NS_FILE") - https://phabricator.wikimedia.org/T203942
[02:21:13] <wikibugs>	 (03PS2) 10Krinkle: Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie)
[02:21:15] <wikibugs>	 (03CR) 10Krinkle: [C: 032] Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie)
[02:22:28] <wikibugs>	 (03Merged) 10jenkins-bot: Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie)
[02:25:09] <logmsgbot>	 !log krinkle@deploy1001 Synchronized w/static.php: T127233 - Ic6acb70 (duration: 00m 49s)
[02:25:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:25:21] <stashbot>	 T127233: Endpoints which do not need to authenticate users should set MW_NO_SESSION - https://phabricator.wikimedia.org/T127233
[02:25:58] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 20 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[02:26:01] <wikibugs>	 (03CR) 10jenkins-bot: Enforce no-session constraint in static.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464391 (https://phabricator.wikimedia.org/T127233) (owner: 10Anomie)
[02:34:08] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 16 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[02:52:38] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[02:52:48] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[02:55:18] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:55:35] <Krinkle>	 Somethign is up
[02:55:39] <Krinkle>	 6,000 exceptions
[02:55:50] <Krinkle>	 }All in the form of: LoadBalancer.php: Transaction spent 10.059820175171 second(s) in writes, exceeding the limit of 3
[02:57:09] <MaxSem>	 Write contention?
[02:57:33] <Krinkle>	 Someting in api, with commons/search referers
[03:03:57] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[03:06:36] <Krinkle>	 Looks like a commonswiki bot editing via API and hitting some issue, possibly with editIncrement
[03:10:07] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[03:10:08] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[03:10:48] <MaxSem>	 Bleh, it's not findable in Hadoop's wmf_raw.apiaction.
[03:28:47] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 887.64 seconds
[03:39:17] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 39 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[03:44:18] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[03:49:07] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 260.17 seconds
[04:20:23] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "already working on an alternative that just edits check_ssl itself instead" [puppet] - 10https://gerrit.wikimedia.org/r/465805 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[05:07:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Marostegui) 05Open>03Resolved So after replacing the disk 3 times yesterday evening...we finally got this fixed! Thanks a lot Chris!  ```  Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Na...
[05:13:33] <wikibugs>	 (03PS1) 10KartikMistry: lttoolbox: New upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/465932 (https://phabricator.wikimedia.org/T206439)
[05:18:39] <wikibugs>	 (03CR) 10Marostegui: "this is not needed and can be abandoned" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465634 (owner: 10Banyek)
[05:20:04] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865)
[05:20:17] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[05:21:47] <wikibugs>	 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui)
[05:21:52] <wikibugs>	 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui)
[05:21:55] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865) (owner: 10Marostegui)
[05:22:59] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10Marostegui)
[05:23:03] <wikibugs>	 10Operations, 10DBA, 10Epic, 10Patch-For-Review: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) 05Open>03Resolved a:03Marostegui All the tasks we scheduled to do whilst eqiad was passive, were done!. We also included T184805 on a last minute task,...
[05:23:09] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865) (owner: 10Marostegui)
[05:23:43] <cscott>	 wikibase appears to be failing its jenkins checkstyle jobs
[05:23:44] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935
[05:24:10] <wikibugs>	 (03PS5) 10Marostegui: db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[05:24:22] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1109 (duration: 00m 51s)
[05:24:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:24:40] <cscott>	 03:14:59 Build step 'Execute shell' marked build as failure
[05:24:40] <cscott>	 03:14:59 [CHECKSTYLE] Collecting checkstyle analysis files...
[05:24:40] <cscott>	 03:14:59 [CHECKSTYLE] Searching for all files in /srv/jenkins-workspace/workspace/mwext-php70-phan-docker that match the pattern log/phan-issues
[05:24:40] <cscott>	 03:14:59 [CHECKSTYLE] No files found. Configuration error?
[05:25:00] <cscott>	 eg https://integration.wikimedia.org/ci/job/mwext-php70-phan-docker/14812/console
[05:25:26] <cscott>	 not sure what's going on there, there doesn't seem to be any recent change to either wikibase or integration-config that obviously explains this
[05:26:01] <marostegui>	 cscott: Maybe create a task for releng?
[05:26:40] <cscott>	 marostegui: sure, just figured i'd mention it here first in case someone was like, oh, i just totally did XYZ that would explain that
[05:27:07] <marostegui>	 cscott: Yeah, from my side, I am not aware of changes there, but it doesn't mean there were not changes of course :-)
[05:27:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935 (owner: 10Marostegui)
[05:27:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 43 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[05:28:11] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Marostegui) I have run an `analyze table` on both db1109 and db2083 and things are similar now:  ```...
[05:28:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935 (owner: 10Marostegui)
[05:29:31] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[05:29:33] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1109 (duration: 00m 48s)
[05:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:09] <cscott>	 marostegui: filed T206738, although i'm not certain i used the right tags
[05:31:10] <stashbot>	 T206738: Wikibase appears to be failing checkstyle on all builds now - https://phabricator.wikimedia.org/T206738
[05:32:06] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Temporary depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465933 (https://phabricator.wikimedia.org/T205865) (owner: 10Marostegui)
[05:32:08] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Temporary depool db1109" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465935 (owner: 10Marostegui)
[05:32:15] <marostegui>	 cscott: I changed one, but there are so many tags that it is not easy to find the right one :)
[05:43:30] <marostegui>	 !log Purge binary logs on pc2005 due to disk space issues - T206740
[05:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:33] <stashbot>	 T206740: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740
[05:46:05] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454)
[05:47:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[05:53:48] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514)
[05:58:34] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui)
[05:59:49] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui)
[06:00:53] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase db1092 weight (duration: 00m 49s)
[06:00:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:32] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465959 (https://phabricator.wikimedia.org/T205514) (owner: 10Marostegui)
[06:07:56] <wikibugs>	 (03PS4) 10Muehlenhoff: Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454)
[06:09:26] <wikibugs>	 (03PS4) 10Urbanecm: Initial configuration for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463482 (https://phabricator.wikimedia.org/T205546)
[06:10:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[06:28:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[06:29:38] <icinga-wm>	 PROBLEM - puppet last run on mw1323 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hphpd/hphpd.ini]
[06:33:33] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454)
[06:35:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[06:39:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[06:55:37] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[06:56:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454)
[06:57:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[06:59:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for SSH [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991)
[06:59:58] <icinga-wm>	 RECOVERY - puppet last run on mw1323 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:02:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 62 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:06:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Disable Diamond on test role [puppet] - 10https://gerrit.wikimedia.org/r/466007 (https://phabricator.wikimedia.org/T183454)
[07:09:51] <wikibugs>	 (03PS1) 10Joal: Bump AQS druid-datasource to 2018_09 [puppet] - 10https://gerrit.wikimedia.org/r/466018
[07:11:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Disable Diamond on test role [puppet] - 10https://gerrit.wikimedia.org/r/466007 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[07:12:57] <wikibugs>	 (03PS2) 10Elukey: Bump AQS druid-datasource to 2018_09 [puppet] - 10https://gerrit.wikimedia.org/r/466018 (owner: 10Joal)
[07:13:48] <wikibugs>	 (03CR) 10Elukey: [C: 032] Bump AQS druid-datasource to 2018_09 [puppet] - 10https://gerrit.wikimedia.org/r/466018 (owner: 10Joal)
[07:17:00] <wikibugs>	 (03PS3) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114)
[07:19:16] <wikibugs>	 (03CR) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[07:26:13] <addshore>	 \o all
[07:26:16] <addshore>	 jouncebot: now
[07:26:16] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 33 minute(s)
[07:26:36] * addshore is going to sync a mw change adding some more tracking to statsd for wikidata dispatching
[07:27:49] <wikibugs>	 (03PS1) 10Elukey: Add stat1007 as role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/466030 (https://phabricator.wikimedia.org/T203852)
[07:28:45] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add stat1007 as role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/466030 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey)
[07:36:29] <elukey>	 !log roll restart of aqs on aqs100[4-9] to pick up new Druid settings
[07:36:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:38:08] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 29 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:42:55] <wikibugs>	 (03CR) 10Gehel: base::monitoring::host: added prometheus check for network drops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[07:43:40] <addshore>	 !log deploy https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/466031 to mwmaint1002 only (increasing tracking of wikidata dispatching) T205865
[07:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:43] <stashbot>	 T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865
[07:45:12] <wikibugs>	 (03PS1) 10Jcrespo: parsercache: Reduce retention time to 7 days due to running out of space [puppet] - 10https://gerrit.wikimedia.org/r/466036
[07:45:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 48 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[07:47:15] <wikibugs>	 (03CR) 10Gilles: [C: 031] Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590 (owner: 10Muehlenhoff)
[07:47:21] <wikibugs>	 (03CR) 10Marostegui: [C: 031] parsercache: Reduce retention time to 7 days due to running out of space [puppet] - 10https://gerrit.wikimedia.org/r/466036 (owner: 10Jcrespo)
[07:50:10] <wikibugs>	 (03PS2) 10Gehel: wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648)
[07:50:37] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui)
[07:51:37] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:52:36] <wikibugs>	 (03PS3) 10Gehel: wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648)
[07:53:36] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel)
[07:56:34] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) Looking at a couple of minutes of extra timing data it looks like this is down to the selec...
[07:57:31] <gehel>	 !log rolling restart blazegraph on wdqs-internal for config change  - T206648
[07:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:34] <stashbot>	 T206648: Increase throttling rates for wdqs internal cluster - https://phabricator.wikimedia.org/T206648
[07:57:57] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[07:59:07] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[07:59:15] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[07:59:23] <wikibugs>	 (03PS4) 10Gehel: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[08:00:07] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[08:00:28] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[08:04:04] <banyek>	 !log purging binary logs on pc1004
[08:04:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:06] <jynus>	 !log running /usr/local/bin/mwscript purgeParserCache.php --wiki=aawiki --age=1900800 --msleep 0
[08:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:08] <_joe_>	 we had a small mediawiki outage there ^^
[08:04:29] <banyek>	 !log purging binary logs on pc1005
[08:04:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:54] <banyek>	 !log purging binary logs on pc1006
[08:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:18] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` stat1007...
[08:06:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Well application/x-www-form-urlencoded is the default for POST requests in almost every HTTP client." (032 comments) [software/service-checker] - 10https://gerrit.wikimedia.org/r/461457 (owner: 10Dduvall)
[08:07:13] <addshore>	 _joe_: what was it?
[08:07:22] <_joe_>	 addshore: still no idea
[08:07:28] <addshore>	 *looks*
[08:07:43] <_joe_>	 we had some more during the night, too
[08:07:47] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 73 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[08:07:53] <_joe_>	 not huge total outages, but still
[08:08:29] <addshore>	 [{exception_id}] {exception_url} Wikimedia\Rdbms\DBQueryError from line 1484 of /srv/mediawiki/php-1.32.0-wmf.24/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema upd....
[08:08:43] <addshore>	 interesting....
[08:09:28] <addshore>	 looks like commons wiki transactions taking too long and throwing
[08:09:35] <_joe_>	 yes
[08:09:44] <_joe_>	 jynus, marostegui, banyek ^^ 
[08:09:51] <marostegui>	 addshore: can you give me a hostname?
[08:09:52] <_joe_>	 it was a short burst though
[08:10:05] <_joe_>	 a client marostegui?
[08:10:08] <_joe_>	 or the db server?
[08:10:14] <addshore>	 this is an example doc in logstash https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.10.11/mediawiki/?id=AWZiHl3Ez9bxnQJdGV4D
[08:10:15] <marostegui>	 the db server 
[08:10:23] <marostegui>	 addshore: thanks I will take it from there
[08:10:45] <addshore>	 It doesnt look like the db server is logged in that log line though
[08:10:52] <jynus>	 that is a known and reported mw issue
[08:11:09] <jynus>	 https://phabricator.wikimedia.org/T202715
[08:11:11] <_joe_>	 yeah it's not logged
[08:11:29] <marostegui>	 Yeah,  but the error is enough
[08:11:34] <marostegui>	 As jynus said, it is not "new"
[08:11:38] <_joe_>	 oh my, a counter on sql
[08:11:44] * _joe_ feels sad
[08:12:08] <_joe_>	 but it's per user, so I'm less sad
[08:12:42] <_joe_>	 still it means every edit makes the cache on that table stale, right?
[08:13:29] <marostegui>	 https://phabricator.wikimedia.org/T202715#4634924
[08:20:29] <jynus>	 !log setting up replication from pc2004 -> pc1004
[08:20:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:47] <jynus>	 _joe_: I don't think so, if it is a single transaction
[08:20:55] <jynus>	 everything will fail
[08:21:10] <jynus>	 but we have like 20 of those report, many made by me
[08:21:37] <jynus>	 I guess it is low impact because it ony impacts if you edit faster than innodb can handle, which is pretty fast
[08:22:44] <marostegui>	 So I guess only bots are really affected?
[08:23:34] <jynus>	 sorry, but I don't know, I only reported the issue, mw core people are aware, cannot do more
[08:23:58] <jynus>	 this is the only thing I know: https://phabricator.wikimedia.org/T202715#4637265
[08:26:12] <jynus>	 !log setting up replication from pc2005 -> pc1005 and from pc2006 -> pc2006
[08:26:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:31] <banyek>	 !log setting up some automated binlog purge mechanism on pc1004,pc1005,pc1006
[08:30:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:16] <wikibugs>	 (03CR) 10Volans: "I agree with the reasoning, it doesn't look too scary to me in our current production environment given the spread crontabs and the Icinga" [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[08:49:59] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Refactor 'use_git_deploy' in wdqs puppet module to cater for scap3 and autodeployment modes - https://phabricator.wikimedia.org/T206597 (10Mathew.onipe) 05Open>03Resolved
[08:51:10] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['stat1007.eqiad.wmnet'] ```  and were **ALL** successful.
[08:51:55] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Refactor 'use_git_deploy' in wdqs puppet module to cater for scap3 and autodeployment modes - https://phabricator.wikimedia.org/T206597 (10Mathew.onipe) 05Resolved>03Open
[08:53:25] <wikibugs>	 (03PS1) 10Elukey: Add IPv6 PTR record for stat1007 [dns] - 10https://gerrit.wikimedia.org/r/466193 (https://phabricator.wikimedia.org/T203852)
[08:53:46] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add IPv6 PTR record for stat1007 [dns] - 10https://gerrit.wikimedia.org/r/466193 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey)
[08:54:56] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) a:05RobH>03elukey
[08:55:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) 05Open>03Resolved Done! Will follow up in another task to replace stat1005 with this new host.
[09:16:10] <wikibugs>	 (03PS2) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342)
[09:19:18] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The binary logs were purged on pc1004. pc1005 pc1006. Also the binlog_max_size were set to 10M and the hosts now have this running in a screen:  ``` while true; do    echo "p...
[09:20:10] <wikibugs>	 (03CR) 10Mforns: "Thanks Andrew! Fixed a missing coma and Jenkins +2'd, I think this is ready. However, we still need to merge the refinery-source patch fir" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns)
[09:26:39] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek)
[09:27:05] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek)
[09:27:29] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek)
[09:28:31] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: change user for autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466209 (https://phabricator.wikimedia.org/T206597)
[09:33:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Cleanup package status during Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454)
[09:34:41] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10akosiaris)
[09:34:57] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454 (10MoritzMuehlenhoff) There's a bug when removing Diamond via the diamond::remove option: The diamond.service remains in a failed state, as the serv...
[09:39:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454)
[09:43:55] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) Thanks @Marostegui but this sadly didn't help. Do you have any other ideas what could cause thes...
[09:44:03] <wikibugs>	 10Operations, 10Traffic: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10ema) p:05Triage>03Normal
[09:45:36] <wikibugs>	 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10ema) p:05Triage>03Normal
[09:47:08] <wikibugs>	 10Operations, 10Traffic: certcentral: delay deployment of renewed certs to wait out skewed client clocks - https://phabricator.wikimedia.org/T204997 (10ema) p:05Triage>03Normal
[09:47:35] <wikibugs>	 10Operations, 10Traffic: Puppetise OCSP stapling for all one-off HTTPS servers - https://phabricator.wikimedia.org/T204992 (10ema) p:05Triage>03Normal
[09:47:56] <wikibugs>	 10Operations, 10Traffic: Consider adding Must-Staple header to enforce revocation checking - https://phabricator.wikimedia.org/T204987 (10ema) p:05Triage>03Normal
[09:48:49] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 4 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) ```wikiadmin@db1109(wikidatawiki)> SELECT * FROM information_schema.tables WHERE table_name = 'w...
[09:55:41] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: change user for autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466209 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[10:00:01] <wikibugs>	 (03PS1) 10Ladsgroup: Set some small wikis to read new for change tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164)
[10:05:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590
[10:06:06] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: First draft of a zotero helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/466287 (https://phabricator.wikimedia.org/T201611)
[10:07:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590 (owner: 10Muehlenhoff)
[10:08:17] <wikibugs>	 (03PS4) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114)
[10:11:00] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "While well intended, there some technicalities that make the two forms not compatible. This will probably require some string mangling to " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[10:14:19] <wikibugs>	 (03CR) 10Muehlenhoff: base/icinga: use MONITORING_HOSTS constant as NRPE allowed_hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465519 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[10:15:54] <wikibugs>	 (03CR) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[10:22:55] <wikibugs>	 (03PS7) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809
[10:23:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk)
[10:26:40] <wikibugs>	 (03PS1) 10ArielGlenn: use lbzip2 for recombining page content dumps, if available and configured [dumps] - 10https://gerrit.wikimedia.org/r/466344 (https://phabricator.wikimedia.org/T179059)
[10:28:56] <wikibugs>	 (03PS2) 10Volans: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611
[10:28:58] <wikibugs>	 (03PS2) 10Volans: PuppetDB: fix regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/465612
[10:31:57] <wikibugs>	 (03CR) 10Mforns: "When I use --deploy-mode with this job, it fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns)
[10:32:20] <wikibugs>	 (03CR) 10Mforns: "I meant --deploy-mode cluster" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns)
[10:33:30] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1:" [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans)
[10:55:37] <wikibugs>	 (03PS8) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809
[11:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1100).
[11:00:04] <jouncebot>	 Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:23] <zeljkof>	 o/
[11:00:39] <zeljkof>	 I'm around but I guess Amir1 will deploy his changes
[11:00:41] <marostegui>	 zeljkof: I have to deploy a patch, we are under an emergency
[11:01:23] <zeljkof>	 marostegui: swat on hold until you give us green light?
[11:01:30] <marostegui>	 thanks should not take long
[11:01:55] <Amir1>	 o/
[11:01:56] <zeljkof>	 Amir1: please wait for marostegui to finish
[11:02:05] <Amir1>	 sure
[11:02:55] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431
[11:03:01] <marostegui>	 banyek: ^
[11:04:49] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[11:04:53] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431 (owner: 10Marostegui)
[11:05:35] <wikibugs>	 (03PS9) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809
[11:06:15] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431 (owner: 10Marostegui)
[11:07:08] <banyek>	 !log binlog expiration set to 60 days on db2045
[11:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:36] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db2085:3318 and db1099:3318 (duration: 00m 49s)
[11:07:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:08:31] <marostegui>	 zeljkof: I am done - thanks!
[11:08:32] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2085:3318 and db1099:3318 (duration: 00m 49s)
[11:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:41] <zeljkof>	 marostegui: thanks!
[11:09:45] <zeljkof>	 Amir1: swat is yours
[11:09:49] <marostegui>	 !log Stop MYSQL on db2088:3318 and db1099:3318 T206743
[11:09:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:52] <stashbot>	 T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743
[11:10:00] <Amir1>	 on it
[11:10:24] <marostegui>	 !log Stop MYSQL on db2085:3318 and db1099:3318 T206743
[11:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:13] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw: Depool db2085:3318, db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466431 (owner: 10Marostegui)
[11:12:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup)
[11:13:47] <wikibugs>	 (03Merged) 10jenkins-bot: Set some small wikis to read new for change tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup)
[11:14:23] <Amir1>	 live on mwdebug1002
[11:16:39] <Amir1>	 I made a mistake, let me fix it ASAP
[11:17:08] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 36 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[11:18:29] <wikibugs>	 (03PS1) 10Ladsgroup: Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478
[11:18:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 032] Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478 (owner: 10Ladsgroup)
[11:20:27] <wikibugs>	 (03Merged) 10jenkins-bot: Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478 (owner: 10Ladsgroup)
[11:23:17] <Amir1>	 zeljkof: ignore "Notice: Use of undefined constant MIGRATION_READ_NEW" in fatal monitor, it's fixed now
[11:23:58] * zeljkof 's hair is on fire ;)
[11:25:43] <addshore>	 \o
[11:25:45] * addshore has one to squeeze into swat if there will be time
[11:26:07] <Amir1>	 it was mwdebug only :D
[11:26:19] * addshore hands zeljkof a glass of water for his hair
[11:27:13] <zeljkof>	 :D
[11:27:22] <addshore>	 Amir1: if you have time https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/466031/ :D
[11:27:24] <wikibugs>	 (03CR) 10jenkins-bot: Set some small wikis to read new for change tag backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466271 (https://phabricator.wikimedia.org/T194164) (owner: 10Ladsgroup)
[11:27:26] <wikibugs>	 (03CR) 10jenkins-bot: Fix constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466478 (owner: 10Ladsgroup)
[11:27:28] <zeljkof>	 addshore: it's just you and Amir1 
[11:27:32] <addshore>	 oh, jenkins said no anyway...
[11:27:44] <Amir1>	 why not :P
[11:28:41] <addshore>	 *fixed*
[11:29:08] <Amir1>	 logs seems clean, moving forward
[11:30:41] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:466271|Set some small wikis to read new for change tag backend (T194164)]] (duration: 00m 50s)
[11:30:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:44] <stashbot>	 T194164: Start reading from change_tag_def in production - https://phabricator.wikimedia.org/T194164
[11:33:33] <Amir1>	 addshore: it's not merged on master yet
[11:33:42] <addshore>	 nope, you want to? :P
[11:33:49] <addshore>	 its already running on mwmaint1002 ;)
[11:35:04] <Amir1>	 I have one nitpick for that :P
[11:35:07] <Amir1>	 addshore: ^
[11:35:14] <addshore>	 *looks*
[11:35:42] <addshore>	 Amir1: will do as a followup
[11:35:47] <addshore>	 and will do it in the other places then too
[11:36:21] <Amir1>	 sounds good
[11:38:08] <Amir1>	 I wait for jenkins and then merge
[11:38:13] <Amir1>	 (The cerry-pick)
[11:40:18] <icinga-wm>	 PROBLEM - DPKG on ores1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:41:40] <addshore>	 Amir1: thanks!
[11:42:28] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[11:44:45] <addshore>	 bah, Amir1, just fixed 1 more phpcs issue
[11:48:05] <Amir1>	 addshore: ping me when jenkins is happy :D
[11:48:19] <addshore>	 Amir1: will do
[11:49:49] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 66 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[11:52:37] <moritzm>	 ^ akosiaris: the removal of ltrace triggered a broken dpkg state for some of the python dbg packages
[11:54:28] <wikibugs>	 (03PS1) 10Ema: ATS: define Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466483 (https://phabricator.wikimedia.org/T204209)
[12:06:04] <wikibugs>	 (03PS1) 10Ema: profile::cache::kafka: fix varnishkafka Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466516
[12:06:57] <wikibugs>	 (03PS1) 10Elukey: Release version 0.4.1+git20181010.2fa99eb [debs/prometheus-memcached-exporter] - 10https://gerrit.wikimedia.org/r/466519
[12:07:34] <wikibugs>	 (03CR) 10Elukey: [C: 032] Release version 0.4.1+git20181010.2fa99eb [debs/prometheus-memcached-exporter] - 10https://gerrit.wikimedia.org/r/466519 (owner: 10Elukey)
[12:08:25] <wikibugs>	 (03PS2) 10ArielGlenn: use lbzip2 for recombining page content dumps, if available and configured [dumps] - 10https://gerrit.wikimedia.org/r/466344 (https://phabricator.wikimedia.org/T179059)
[12:08:32] <wikibugs>	 (03CR) 10Elukey: [C: 031] "<3" [puppet] - 10https://gerrit.wikimedia.org/r/466516 (owner: 10Ema)
[12:10:35] <addshore>	 Amir1: not green yet, jenkins is taking an age
[12:10:43] * addshore is stepping out for 15 mins
[12:11:05] <wikibugs>	 (03CR) 10Ema: [C: 032] profile::cache::kafka: fix varnishkafka Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466516 (owner: 10Ema)
[12:12:27] <wikibugs>	 (03PS3) 10ArielGlenn: use lbzip2 for recombining page content dumps, if available and configured [dumps] - 10https://gerrit.wikimedia.org/r/466344 (https://phabricator.wikimedia.org/T179059)
[12:12:49] <Amir1>	 addshore: I close the SWAT, let's do it later
[12:12:55] <Amir1>	 !log EU SWAT is done
[12:12:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:55] <wikibugs>	 (03PS1) 10BBlack: move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539
[12:14:22] <elukey>	 !log upload prometheus-memcached-exporter_0.4.1+git20181010.2fa99eb-1 to (jessie|stretch)-wikimedia
[12:14:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack)
[12:14:43] <elukey>	 this has been already tested in deployment-prep, contains new metrics --^
[12:15:11] <elukey>	 !log upgrade prometheus-memcached-exporter on mc2035
[12:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:48] <wikibugs>	 (03PS2) 10Ema: ATS: define Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466483 (https://phabricator.wikimedia.org/T204209)
[12:20:42] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466550 (https://phabricator.wikimedia.org/T206597)
[12:25:50] <wikibugs>	 (03PS1) 10ArielGlenn: dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059)
[12:26:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn)
[12:27:35] <wikibugs>	 (03PS2) 10BBlack: move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539
[12:27:54] <wikibugs>	 (03CR) 10Ema: [C: 032] ATS: define Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/466483 (https://phabricator.wikimedia.org/T204209) (owner: 10Ema)
[12:28:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack)
[12:29:25] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] "Jenkins is complaining about style there's no reasonable fix for.  This verifies as NO-OP on the authdns servers: https://puppet-compiler." [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack)
[12:29:43] <wikibugs>	 (03PS3) 10BBlack: move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539
[12:29:53] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] move role::authdns::data::nameservers to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/466539 (owner: 10BBlack)
[12:32:32] <wikibugs>	 (03PS2) 10ArielGlenn: dumps config settings to use lbzip2 for recombining page content files [puppet] - 10https://gerrit.wikimedia.org/r/466554 (https://phabricator.wikimedia.org/T179059)
[12:35:21] <wikibugs>	 (03PS1) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639)
[12:38:40] <elukey>	 !log upgrade prometheus-memcached-exporter on mc2*
[12:38:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:34] <wikibugs>	 (03CR) 10Gehel: "good start, did you check that all the clients are connecting in a way that is compatible?" [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe)
[12:43:26] <elukey>	 !log upgrade prometheus-memcached-exporter on mc1*
[12:43:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:19] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "Minor comment inline. I'm pretty sure that puppet compiler will show you an issue." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466550 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[12:46:11] <wikibugs>	 (03CR) 10Gehel: [C: 031] "Readable enough, LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans)
[12:47:44] <wikibugs>	 (03CR) 10Volans: [C: 032] Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans)
[12:48:39] <icinga-wm>	 PROBLEM - HHVM rendering on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[12:49:38] <icinga-wm>	 RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 74827 bytes in 0.127 second response time
[12:51:16] <wikibugs>	 (03Merged) 10jenkins-bot: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans)
[12:52:53] <wikibugs>	 (03CR) 10jenkins-bot: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans)
[13:01:12] <wikibugs>	 (03CR) 10Imarlier: "Ping @bblack @ema" [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) (owner: 10Imarlier)
[13:02:28] <wikibugs>	 (03PS1) 10Anomie: Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732)
[13:03:22] <wikibugs>	 (03CR) 10Anomie: "Deploying Beta Cluster config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie)
[13:03:35] <wikibugs>	 (03CR) 10Anomie: [C: 032] Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie)
[13:05:20] <wikibugs>	 (03Merged) 10jenkins-bot: Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie)
[13:06:20] <wikibugs>	 (03PS1) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352)
[13:09:21] <wikibugs>	 (03CR) 10jenkins-bot: Fix wgActorTableSchemaMigrationStage on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466590 (https://phabricator.wikimedia.org/T206732) (owner: 10Anomie)
[13:11:47] <wikibugs>	 (03PS2) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352)
[13:12:40] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743)
[13:14:09] <marostegui>	 banyek ^
[13:14:37] <wikibugs>	 (03CR) 10Banyek: [C: 031] db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[13:15:27] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db2050 is CRITICAL: cluster=mysql device=cciss,6 instance=db2050:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2050&var-datasource=codfw%2520prometheus%252Fops
[13:17:48] <marostegui>	 banyek: can you just ack that alert? ^
[13:18:07] <marostegui>	 don't even create a ticket for it, let's leave the disk fail by itself
[13:18:46] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[13:19:00] <wikibugs>	 (03Abandoned) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466550 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[13:19:53] <wikibugs>	 (03PS3) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352)
[13:20:24] <marostegui>	 !log Stop MySQL on db1116:3318 to reclone it from db2083 - T206743
[13:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:28] <stashbot>	 T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743
[13:20:28] <akosiaris>	 moritzm: yeah I know. I am gdbing 
[13:20:36] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[13:20:44] <akosiaris>	 unsuccessfully up to now
[13:21:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel)
[13:21:22] <akosiaris>	 seems like mwparserfromhell is not honoring what gdb python scripts are expecting and e.g. py-bt doesn't work
[13:21:32] <wikibugs>	 (03PS4) 10Gehel: relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352)
[13:21:38] <banyek>	 ok
[13:21:42] <akosiaris>	 that damn backtrace ranges from 220-307 frames
[13:21:46] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2083 (duration: 00m 49s)
[13:21:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:00] <akosiaris>	 at least give what I 've seen up t now
[13:22:27] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2050 is CRITICAL: cluster=mysql device=cciss,6 instance=db2050:9100 job=node site=codfw Banyek ack https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2050&var-datasource=codfw%2520prometheus%252Fops
[13:22:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] relforge: setup 2 instances to validate multi-instance configuration [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel)
[13:23:44] <marostegui>	 !log Stop MySQL on db2083 to reclone db1116:3318 - T206743
[13:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:48] <wikibugs>	 10Operations, 10cloud-services-team: Create a jessie netboot image with the 4.9 Linux kernel - https://phabricator.wikimedia.org/T206761 (10MoritzMuehlenhoff)
[13:25:01] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466594 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[13:26:33] <jynus>	 !log filling in missing rows on dbstore1002
[13:26:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:04] <wikibugs>	 (03PS1) 10Ema: varnish: add vtc test for sitemap rewrites [puppet] - 10https://gerrit.wikimedia.org/r/466602 (https://phabricator.wikimedia.org/T206496)
[13:29:58] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466603 (https://phabricator.wikimedia.org/T206597)
[13:31:44] <wikibugs>	 (03Abandoned) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466603 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[13:32:18] <wikibugs>	 (03CR) 10Ema: [C: 032] varnish: add vtc test for sitemap rewrites [puppet] - 10https://gerrit.wikimedia.org/r/466602 (https://phabricator.wikimedia.org/T206496) (owner: 10Ema)
[13:36:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[13:40:07] <icinga-wm>	 PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:42:42] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597)
[13:43:38] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 68 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[13:48:28] <wikibugs>	 (03CR) 10Mathew.onipe: "Puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[13:53:35] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597)
[13:55:33] <jynus>	 !log recovering rows to db1092
[13:55:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:27] <icinga-wm>	 RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[14:07:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] cache::upload: Move swift to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458796 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[14:08:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "+1 as long as we move to a/a ASAP after the switchback is done." [puppet] - 10https://gerrit.wikimedia.org/r/458797 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[14:08:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 31 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[14:09:22] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] smarthost: create mail smarthost role/profile (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[14:11:47] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1339 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time
[14:11:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw1339 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[14:12:57] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1339 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.069 second response time
[14:13:07] <icinga-wm>	 RECOVERY - HHVM rendering on mw1339 is OK: HTTP OK: HTTP/1.1 200 OK - 74827 bytes in 0.151 second response time
[14:13:22] <moritzm>	 !log install libxml2 security updates on jessie servers
[14:13:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:51] <jynus>	 !log applying row filling to (most) eqiad s8 dbs, including the mater
[14:15:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:58] <elukey>	 !log reboot eventlog1002 for kernel upgrades
[14:15:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:08] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 64 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[14:22:27] <wikibugs>	 (03PS1) 10Banyek: mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618
[14:23:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (owner: 10Banyek)
[14:24:03] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) p:05Unbreak!>03High This is no longer "unbreak now" ``` pc1004 Filesystem                 Type  Size  Used Avail Use% Mounted on /dev/mapper/pc1004--vg-srv xfs   2.2T...
[14:24:42] <akosiaris>	 FYI, T-6m for services switchover
[14:25:38] <wikibugs>	 (03PS2) 10Banyek: mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743)
[14:27:31] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743) (owner: 10Banyek)
[14:27:47] <wikibugs>	 (03CR) 10Banyek: [V: 032] mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743) (owner: 10Banyek)
[14:28:09] <banyek>	 !log depooling db1087
[14:28:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:17] <banyek>	 !log depooling db1087 (T206743)
[14:28:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:20] <stashbot>	 T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743
[14:28:39] <akosiaris>	 FYI, T-2m for services switchover
[14:29:24] <akosiaris>	 please cease all other operational activity for 10mins
[14:30:04] <jouncebot>	 Deploy window Datacenter Switchback - Services (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1430)
[14:30:11] <akosiaris>	 starting
[14:30:13] <logmsgbot>	 !log START - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (volans@neodymium)
[14:30:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:20] <mobrovac>	 akosiaris: only 10 mins? :)
[14:30:36] <volans>	 akosiaris: user stealer!
[14:30:39] <akosiaris>	 I am being optimistic 
[14:30:44] <logmsgbot>	 !log banyek@deploy1001 Synchronized wmf-config/db-eqiad.php: T206743: mariadb: Depool db1087 (duration: 00m 49s)
[14:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:30:51] <akosiaris>	 volans: indeed. I did not think of that
[14:31:00] <akosiaris>	 well anyway you get extra credits :P
[14:31:00] <_joe_>	 what are you doing guys?
[14:31:11] <volans>	 I'm doing nothing
[14:31:18] <volans>	 alex is using my tmux from yesterday :D
[14:31:21] <volans>	 with my $USER set
[14:31:35] <akosiaris>	 banyek: please pause for a while
[14:31:39] <_joe_>	 akosiaris: you can just run the script I pasted you
[14:31:52] <mark>	 better blame volans.jpg
[14:31:57] <volans>	 lol
[14:32:19] <akosiaris>	 _joe_: I 'll do that after with argument codfw 
[14:32:28] <akosiaris>	 but for now I am running the cookbook as normal
[14:32:40] <_joe_>	 uhm ok, we need to remove aqs then
[14:33:01] <akosiaris>	 and restbase for now
[14:33:12] <akosiaris>	 and do it after swift is also switched back or tomorrow
[14:33:27] <_joe_>	 ok, your call :)
[14:34:24] <elukey>	 something about aqs? :)
[14:34:25] <akosiaris>	 this five mins wait for the TTL ...
[14:34:43] <akosiaris>	 _joe_: why was aqs in the list in the first place anyway ?
[14:35:20] <_joe_>	 I just didn't remove it, I noticed some minutes ago but thought it wouldn't count
[14:35:30] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.services.00-reduce-ttl-and-sleep (exit_code=0) (volans@neodymium)
[14:35:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:32] <_joe_>	 akosiaris: still blaming me?
[14:35:36] <_joe_>	 :P
[14:36:05] <logmsgbot>	 !log START - Cookbook sre.switchdc.services.01-switch-dc (volans@neodymium)
[14:36:05] <logmsgbot>	 !log Switching services parsoid, restbase, restbase-async, mobileapps, apertium, citoid, cxserver, eventstreams, graphoid, mathoid, proton, pdfrender, recommendation-api, zotero, eventbus, ores, wdqs, wdqs-internal: codfw => eqiad (volans@neodymium)
[14:36:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:24] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.services.01-switch-dc (exit_code=0) (volans@neodymium)
[14:36:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:36:49] <akosiaris>	 ok, checking before restoing TTL
[14:37:12] <wikibugs>	 10Operations, 10Gadgets, 10MediaWiki-Cache, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc[1,2]035 leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Finally build and deployed the new prometheus-memcached-exporter on the mc*...
[14:37:15] <_joe_>	 akosiaris: I think you missed kartotherian
[14:37:25] <_joe_>	 but it's not used, so meh
[14:37:31] <_joe_>	 the discovery endpoint I mean
[14:38:03] <akosiaris>	 yeah it's directly connected to from varnish
[14:38:08] <akosiaris>	 it was done yesterday 
[14:38:10] <akosiaris>	 in upload
[14:38:15] <_joe_>	 yup
[14:38:47] <mobrovac>	 ok looks good
[14:40:57] <_joe_>	 akosiaris: do you want to repool codfw now or later?
[14:41:03] <_joe_>	 everything is ok in etcd
[14:41:08] <akosiaris>	 _joe_: later, after swift is done
[14:41:19] <akosiaris>	 let's play it by the book
[14:41:41] <akosiaris>	 everything looks fine up to now
[14:41:47] <_joe_>	 you can restore the ttl, btw, the cookbook checks by itself
[14:41:50] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: depool db1087 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466618 (https://phabricator.wikimedia.org/T206743) (owner: 10Banyek)
[14:41:54] <akosiaris>	 I 'll do another round of checks and restore the TTL
[14:42:05] <_joe_>	 rb timeouts in codfw
[14:42:07] <_joe_>	 uhm
[14:42:14] <akosiaris>	 yup just noticed them
[14:42:19] <akosiaris>	 both eqiad+codfw
[14:42:34] <akosiaris>	 still at 1/3 checks 
[14:42:42] <_joe_>	 so most probably unrelated
[14:42:47] <_joe_>	 but ofc it happens now
[14:43:05] <mobrovac>	 looking
[14:43:19] <akosiaris>	 alert is /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received
[14:43:33] <akosiaris>	 2/3 for some hosts, but they are fewer now
[14:43:45] <mobrovac>	 ah ok
[14:43:48] <mobrovac>	 that's mcs
[14:44:18] <_joe_>	 yes
[14:44:33] <_joe_>	 and btw I ran it on a couple servers and it completes successfully now
[14:45:01] <akosiaris>	 I 'll give it another minute just in case
[14:45:12] <mobrovac>	 yeah we really have to look into that one and why that is happening
[14:45:36] <mobrovac>	 tried on some hosts and looks to be back now
[14:45:59] <akosiaris>	 yeah it's flapping a bit
[14:46:09] <akosiaris>	 well.. icinga just now detects it
[14:46:18] <akosiaris>	 I am guessing things are fine though
[14:46:25] <akosiaris>	 we need something better than icinga soon
[14:46:37] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[14:46:58] <akosiaris>	 ok all alerts gone, I am restoring the TTL
[14:47:03] <logmsgbot>	 !log START - Cookbook sre.switchdc.services.02-restore-ttl (volans@neodymium)
[14:47:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:23] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.services.02-restore-ttl (exit_code=0) (volans@neodymium)
[14:47:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:51] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743)
[14:51:21] <akosiaris>	 T-10 for swift
[14:51:41] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[14:52:25] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cache::upload: Move swift to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458796 (https://phabricator.wikimedia.org/T203777)
[14:52:35] <jynus>	 !log deploying wikidata row fix to db1087 with replication enabled
[14:52:36] <volans>	 akosiaris: was the swiftrepl totally stopped or has to be switched too manually?
[14:52:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:41] <volans>	 I know is only for reconciliation now
[14:52:54] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cache::upload: Move swift to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458797 (https://phabricator.wikimedia.org/T203777)
[14:53:09] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[14:53:48] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 51 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[14:53:54] <akosiaris>	 volans: IIRC, no
[14:54:49] <wikibugs>	 (03PS1) 10Jcrespo: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653
[14:54:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo)
[14:55:00] <volans>	 akosiaris: so we just stopped it, trust on the double write and alarm on the graph if the diff is too high
[14:55:20] <akosiaris>	 and use it to reconcile
[14:56:56] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 (duration: 00m 48s)
[14:56:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:01] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466652 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[14:59:05] <akosiaris>	 T-2m for swift
[15:00:04] <jouncebot>	 Deploy window Datacenter Switchback - Media storage/Swift (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1500)
[15:00:09] <akosiaris>	 starting swift switchback
[15:00:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] cache::upload: Move swift to active/active [puppet] - 10https://gerrit.wikimedia.org/r/458796 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[15:00:47] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 98 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[15:01:10] <akosiaris>	 !log Media storage/Swift Swift set to active/active
[15:01:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:27] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743)
[15:02:58] <akosiaris>	 puppet runs done
[15:03:11] <akosiaris>	 proceeding with moving swift to eqiad (and then undo it later on :))
[15:03:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] cache::upload: Move swift to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458797 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[15:03:35] <volans>	 akosiaris: export USER=akosiaris :-P
[15:03:41] <volans>	 sorry
[15:03:44] <volans>	 SUDO_USER
[15:03:44] <akosiaris>	 SUDO_USER :P
[15:04:30] <akosiaris>	 !log Media storage/Swift Swift set to active/passive
[15:04:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:45] <marostegui>	 akosiaris: I will wait for your green light to deploy that db-eqiad.php to depool a DB
[15:05:27] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 42 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[15:06:48] <akosiaris>	 puppet done
[15:07:07] <akosiaris>	 graphs look normal
[15:07:34] <akosiaris>	 no alerts
[15:07:57] <akosiaris>	 marostegui: I think you can proceed
[15:08:09] <marostegui>	 ta!
[15:08:11] <akosiaris>	 everything looks fine 
[15:08:12] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[15:09:55] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[15:11:09] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 (duration: 00m 49s)
[15:11:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:15] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466658 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[15:12:19] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Migrate the hardware inventory from Racktables to Netbox - https://phabricator.wikimedia.org/T199083 (10Volans) 05Open>03Resolved
[15:12:38] <wikibugs>	 10Operations, 10Tracking: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 (10Volans)
[15:12:45] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144 (10Volans) 05Open>03Resolved
[15:12:50] <marostegui>	 !log Stop MySQL on db2085:3318 to reclone db1101:3318 - T206743
[15:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:54] <stashbot>	 T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743
[15:13:26] <wikibugs>	 (03CR) 10Imarlier: [C: 031] Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:14:53] <wikibugs>	 10Operations, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Watching / External): Update Debian package of Blubber (0.6.0-1) - https://phabricator.wikimedia.org/T206766 (10thcipriani)
[15:14:59] <apergos>	 what do folks think about the increased thumbor host latency? 
[15:15:17] <_joe_>	 apergos: can you explain better?
[15:15:24] <apergos>	 https://grafana.wikimedia.org/dashboard/db/thumbor?orgId=1&from=now-3h&to=now
[15:16:02] <_joe_>	 apergos: we're back to reading on swift eqiad
[15:16:14] <_joe_>	 which is "colder" I'd expect
[15:17:06] <apergos>	 I had some vague notion that, like file uploads, thumbs are synced; maybe that's wrong
[15:17:19] <_joe_>	 also looks like codfw performs better
[15:17:33] <_joe_>	 see https://grafana.wikimedia.org/dashboard/db/thumbor?orgId=1&from=1535375521844&to=1539270996453&panelId=5&fullscreen
[15:17:53] <_joe_>	 after the switchover, poerf in codfw is apparently much better
[15:18:08] <_joe_>	 not sure what swift url is used by thumbor, though
[15:18:21] <apergos>	 that's interesting
[15:18:47] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on ores1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages alexandros kosiaris gdbing celery to figure out why some workers are busy looping.
[15:19:47] <apergos>	 left to do is still restbase, is that right? anything else?
[15:19:54] <akosiaris>	 _joe_: en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: 'NoneType' object has no attribute 'get'):
[15:20:00] <akosiaris>	 that looks like a bug in service-checker
[15:20:18] <akosiaris>	 probably needs an if blah: somewhere
[15:20:38] <akosiaris>	 apergos: what's left to do is repool all the active/active services
[15:20:47] <_joe_>	 akosiaris: no, it looks like we expected a json response and we got back an empty body
[15:20:50] <apergos>	 ah indeed
[15:20:53] <mobrovac>	 hm that means the body does not match the expected response
[15:21:11] <mobrovac>	 which host is that on?
[15:21:16] <_joe_>	 mobrovac: no, that the body is empty, tipically
[15:21:18] <akosiaris>	 yeah but doing a get on NoneType is wrong 
[15:21:18] <_joe_>	 all of them
[15:21:22] <_joe_>	 no sorry
[15:21:29] <_joe_>	 on other hosts you get 504
[15:21:34] <akosiaris>	 check if it's None and possibly alert on that
[15:21:46] <akosiaris>	 multiple restbase hosts
[15:21:56] <akosiaris>	 restbase1013, restbase1015, restbase2007
[15:22:00] <_joe_>	 it's now just a couple
[15:22:02] <mobrovac>	 huh k
[15:22:11] <_joe_>	 but more of them were responding 504 to that same test
[15:23:10] <_joe_>	 it's all gone now
[15:23:23] <mobrovac>	 _joe_: looking at the output, there is a return body, but apparently a part of the expected hash obj is missing so we get that obscure error
[15:23:43] <mobrovac>	 heh indeed, it's healthy now
[15:23:46] <_joe_>	 mobrovac: ok can you write a bug? I might have to look at service-checker tomorrow
[15:23:53] <_joe_>	 anyways
[15:23:59] <mobrovac>	 btw _joe_, 504's and python errors were for different checks
[15:26:52] <wikibugs>	 (03PS2) 10Jcrespo: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653
[15:28:36] <apergos>	 there was a brief crit for thumbor.svc.eqiad but it went away before I coudl read it
[15:29:53] <wikibugs>	 (03PS3) 10Jcrespo: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653
[15:30:47] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[15:31:17] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 17 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[15:32:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo)
[15:32:29] <wikibugs>	 (03CR) 10Cwhite: "The reasoning appears sound to me." [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[15:33:21] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10Ottomata)
[15:33:49] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo)
[15:34:12] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Cleanup systemd state on Diamond removal [puppet] - 10https://gerrit.wikimedia.org/r/466217 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:38:34] <logmsgbot>	 !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1087 (duration: 00m 50s)
[15:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:44] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris)
[15:38:47] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777 (10akosiaris) 05Open>03Resolved a:03akosiaris Mediawiki and traffic were successfully switched yesterday,...
[15:39:08] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10jcrespo) Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).
[15:42:33] <wikibugs>	 (03CR) 10jenkins-bot: Revert "mariadb: depool db1087" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466653 (owner: 10Jcrespo)
[15:54:17] <wikibugs>	 (03PS3) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454)
[16:00:04] <jouncebot>	 godog and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1600). Please do the needful.
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:01:03] <wikibugs>	 (03CR) 10Volans: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[16:04:45] <wikibugs>	 (03PS2) 10Gehel: wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[16:06:24] <wikibugs>	 (03PS2) 10Gehel: wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[16:06:58] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:08:49] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: fixed autodeploy cron logging permission issue [puppet] - 10https://gerrit.wikimedia.org/r/466608 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[16:10:37] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris)
[16:10:38] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time
[16:10:38] <icinga-wm>	 PROBLEM - Apache HTTP on mw1348 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time
[16:10:48] <wikibugs>	 (03PS3) 10Gehel: wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[16:11:05] <wikibugs>	 (03CR) 10EBernhardson: "Overall looks good, not clear on why it was necessary to have some tlsproxy's not be a default_server" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel)
[16:11:15] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: add logging message to autodeploy cron task [puppet] - 10https://gerrit.wikimedia.org/r/466610 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[16:11:23] <wikibugs>	 (03CR) 10Cwhite: [C: 032] hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[16:11:25] <wikibugs>	 (03PS4) 10Cwhite: hiera: enable ntp collector on role::recursor [puppet] - 10https://gerrit.wikimedia.org/r/464905 (https://phabricator.wikimedia.org/T183454)
[16:11:38] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.029 second response time
[16:11:42] <wikibugs>	 10Operations, 10Goal, 10Patch-For-Review: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10akosiaris) 05Open>03Resolved a:03akosiaris Successfully switched (with some aftermath and actionables but successfully nevertheless) to codfw and back per the subtasks,...
[16:11:47] <icinga-wm>	 RECOVERY - Apache HTTP on mw1348 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.031 second response time
[16:13:18] <wikibugs>	 (03CR) 10Gehel: "Note that this is still WIP and non working yet at all." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466591 (https://phabricator.wikimedia.org/T198352) (owner: 10Gehel)
[16:13:58] <icinga-wm>	 PROBLEM - Check systemd state on dns1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:14:09] <bblack>	 ^ looking
[16:14:17] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 63 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:15:04] <bblack>	 oh hmm
[16:15:08] <bblack>	 cwhite: ?? dns1001
[16:15:36] <bblack>	 wrong IRC name!
[16:16:27] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 3632 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:16:28] <bblack>	 shdubsh: what's up on dns1001?
[16:16:55] <shdubsh>	 bblack: must be something with prometheus-node-exporter.  having a look
[16:17:28] <icinga-wm>	 PROBLEM - Check systemd state on dns4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:18:02] <vgutierrez>	 ^^ dns4001 as well
[16:18:46] <wikibugs>	 (03PS1) 10Cwhite: Revert "hiera: enable ntp collector on role::recursor" [puppet] - 10https://gerrit.wikimedia.org/r/466686
[16:18:57] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 504 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test
[16:18:57] <icinga-wm>	 read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 504 (expecting: 200)
[16:19:06] <bblack>	 yeah it seems like the ntp collector is faulty somehow
[16:19:13] <wikibugs>	 (03PS2) 10Cwhite: Revert "hiera: enable ntp collector on role::recursor" [puppet] - 10https://gerrit.wikimedia.org/r/466686
[16:19:21] <bblack>	 shouldn't break dns/ntp though
[16:19:24] <wikibugs>	 (03CR) 10Cwhite: [V: 032 C: 032] Revert "hiera: enable ntp collector on role::recursor" [puppet] - 10https://gerrit.wikimedia.org/r/466686 (owner: 10Cwhite)
[16:19:46] <shdubsh>	 right
[16:19:58] <icinga-wm>	 PROBLEM - Check systemd state on dns5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:19:58] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[16:20:22] <shdubsh>	 reverting now
[16:20:28] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 48.27 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:20:57] <icinga-wm>	 PROBLEM - Check systemd state on dns2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:21:07] <icinga-wm>	 PROBLEM - Check systemd state on dns5002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:21:28] <icinga-wm>	 RECOVERY - Check systemd state on dns1001 is OK: OK - running: The system is fully operational
[16:21:57] <icinga-wm>	 RECOVERY - Check systemd state on dns4001 is OK: OK - running: The system is fully operational
[16:23:08] <icinga-wm>	 RECOVERY - Check systemd state on dns2001 is OK: OK - running: The system is fully operational
[16:23:18] <icinga-wm>	 RECOVERY - Check systemd state on dns5002 is OK: OK - running: The system is fully operational
[16:24:23] <wikibugs>	 (03PS1) 10Gehel: logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687
[16:24:48] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 79.72 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[16:25:56] <wikibugs>	 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) Thank you for the detailed explanation. I will get back to Legal and MarkMonitor about it.
[16:26:03] <wikibugs>	 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) a:03Dzahn
[16:26:17] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 441 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:26:38] <icinga-wm>	 RECOVERY - Check systemd state on dns5001 is OK: OK - running: The system is fully operational
[16:27:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel)
[16:28:37] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 109 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:29:17] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 122 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:29:26] <wikibugs>	 (03PS51) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962)
[16:29:45] <wikibugs>	 (03PS2) 10Gehel: logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687
[16:30:34] <wikibugs>	 (03CR) 10Gehel: "puppet compiler on a few nodes agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler1002/12874/" [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel)
[16:30:36] <wikibugs>	 (03PS10) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809
[16:31:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809 (owner: 10Alex Monk)
[16:32:32] <wikibugs>	 (03CR) 10Herron: [C: 04-1] "Seems safe on paper, but I'm -1 because it carries risk of ssh and cumin lockout across many systems at a time leaving only serial access " [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[16:32:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel)
[16:32:47] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 4397 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:33:38] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 20 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:34:18] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 18 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:38:48] <wikibugs>	 (03PS52) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962)
[16:38:50] <wikibugs>	 (03PS11) 10Alex Monk: Certcentral-authdns integration [puppet] - 10https://gerrit.wikimedia.org/r/459809
[16:38:52] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel)
[16:39:38] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 32 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:44:01] <wikibugs>	 (03PS3) 10Gehel: logrotate: add type contraints to parameters [puppet] - 10https://gerrit.wikimedia.org/r/466687
[16:44:52] <wikibugs>	 (03PS53) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk)
[16:46:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk)
[16:46:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 70 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[16:47:07] <paravoid>	 XioNoX, bblack ^^
[16:47:08] <paravoid>	 what's that?
[16:47:32] <paravoid>	 there's a bunch of them hitting the threshold
[16:47:57] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 515 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:48:04] <bblack>	 good question!
[16:48:10] <XioNoX>	 yeah, and I increased the threshold in https://phabricator.wikimedia.org/T205829
[16:48:43] <bblack>	 either our ipv6 is awful, or the internet's is, or ripe's ipv6 probe set is awful, one of the three?
[16:48:46] <XioNoX>	 I had a quick look and can't find anything related to our network, or any common middleman
[16:48:49] <bblack>	 hopefully not the first option
[16:49:09] <wikibugs>	 (03PS54) 10Vgutierrez: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk)
[16:49:26] <XioNoX>	 maybe next step is to email the ripe atlas team?
[16:49:33] <wikibugs>	 (03PS1) 10Valerie: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466692 (https://phabricator.wikimedia.org/T206731)
[16:49:48] <bblack>	 have you checked the failing probes themselves?
[16:50:01] <bblack>	 maybe we can find out why for some subset of cases and see a pattern?
[16:50:29] <XioNoX>	 bblack: that's what I did for the 20ish failing on that comment https://phabricator.wikimedia.org/T205829#4652785
[16:51:29] <wikibugs>	 (03PS1) 10Gehel: rsyslog: replace deprecated validate_numeric() with type contraints [puppet] - 10https://gerrit.wikimedia.org/r/466693
[16:52:13] <XioNoX>	 I have a list of 53 now, from the most recent of https://atlas.ripe.net/measurements/1790947/#!probes
[16:53:17] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 5349 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:53:37] <wikibugs>	 (03PS1) 10Kaldari: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731)
[16:54:17] <gehel>	 !log depooling wdqs1003 to let it catch up on lag
[16:54:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:39] <wikibugs>	 (03Abandoned) 10Kaldari: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466692 (https://phabricator.wikimedia.org/T206731) (owner: 10Valerie)
[16:57:17] <XioNoX>	 also they are anchors to anchors probes, so they shoud be more reliable than probes
[16:57:28] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 178 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:58:36] <wikibugs>	 (03PS55) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962)
[16:59:56] <XioNoX>	 here is the new traceroute mesurement https://atlas.ripe.net/measurements/1790947/#!probes
[17:00:16] <XioNoX>	 er, https://atlas.ripe.net/measurements/16459053/#!general
[17:00:43] <wikibugs>	 (03PS1) 10Cwhite: prometheus: add collector.ntp.server option and enable on recursor nodes [puppet] - 10https://gerrit.wikimedia.org/r/466696 (https://phabricator.wikimedia.org/T183454)
[17:00:45] <wikibugs>	 (03CR) 10Kaldari: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari)
[17:04:28] <bblack>	 taking a random example of the "unreachable" there
[17:04:31] <bblack>	 https://atlas.ripe.net/probes/6028/#!tab-builtins
[17:04:44] <bblack>	 ^ this one seems healthy on ipv4, but can't reach most (all?) of the DNS root servers over ipv6
[17:04:48] <bblack>	 doesn't seem like our problem
[17:05:08] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 4427 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[17:05:16] <bblack>	 I've only checked one or two of them though
[17:05:55] <bblack>	 but this is another one, where they mostly succeed on ipv6: https://atlas.ripe.net/probes/6085/#!tab-builtins
[17:05:58] <bblack>	 (but fail with us)
[17:07:04] <bblack>	 this one is unreachable to us but succeeds with all the built-ins: https://atlas.ripe.net/probes/6116/#!tab-builtins
[17:07:33] <XioNoX>	 bblack: the 6085, I can reach it from bast1002, so it will probably recover on the next run
[17:08:09] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster, 10User-Elukey: Manage Hue via systemd unit - https://phabricator.wikimedia.org/T206484 (10fdans) p:05Triage>03Normal
[17:08:18] <icinga-wm>	 PROBLEM - Blazegraph Port on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused
[17:08:24] <banyek|away>	 !log automated binlog purging started on pc2004, pc2005, pc2006
[17:08:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:28] <XioNoX>	 6116 works too from us, but goes through some congested HE network
[17:09:17] * gehel is looking at blazegraph on wdqs1009 (test server, not critical), cc onimisionipe 
[17:09:27] <icinga-wm>	 RECOVERY - Blazegraph Port on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999
[17:09:53] <wikibugs>	 (03PS3) 10Cwhite: nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454)
[17:10:23] <gehel>	 onimisionipe: looks like autodeploy worked just fine, but took slightly longer than expected to restart and was detected by icinga
[17:12:21] <elukey>	 moritzm: about nutcracker.py - it seems that we are still using it, maybe worth keeping it for a bit? I don't remember if we have a prometheus exporter for nutcracker (probably not)
[17:12:32] <wikibugs>	 10Operations, 10netops: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10ayounsi) p:05Triage>03Low
[17:12:49] <elukey>	 Cc: shdubsh --^
[17:13:18] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time
[17:13:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10akosiaris) @Dzahn anything left here ?
[17:14:27] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.047 second response time
[17:14:31] <shdubsh>	 elukey: puppet has one here: modules/prometheus/manifests/nutcracker_exporter.pp  is it not in use?
[17:14:36] <wikibugs>	 10Operations, 10netops: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778 (10RobH)  Your case # 00532252: Wikimedia Foundation, Inc._Existing Customer_San Francisco has been updated with the following:        "IPv6 Network Information:  Network: 2607:fb58:9000:7::/64 Gateway: 2607:fb58:900...
[17:15:37] <wikibugs>	 (03PS56) 10Alex Monk: Central certificates service [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962)
[17:16:34] <elukey>	 yep! https://grafana.wikimedia.org/dashboard/db/nutcracker
[17:16:37] <elukey>	 nevermind then
[17:16:40] <elukey>	 just wanted to make sure
[17:17:06] <shdubsh>	 thanks for double-checking :)
[17:17:19] <wikibugs>	 (03CR) 10Cwhite: [C: 032] nutcracker: ensure absent nutcracker.py [puppet] - 10https://gerrit.wikimedia.org/r/464917 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[17:18:02] <wikibugs>	 (03PS1) 10Gehel: base::service_unit: add type constraints on parameters [puppet] - 10https://gerrit.wikimedia.org/r/466697
[17:20:49] <wikibugs>	 (03PS1) 10Cwhite: nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454)
[17:21:17] <icinga-wm>	 PROBLEM - Blazegraph Port on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused
[17:22:20] <wikibugs>	 (03PS2) 10Cwhite: nutcracker: remove diamond collector resource [puppet] - 10https://gerrit.wikimedia.org/r/466698 (https://phabricator.wikimedia.org/T183454)
[17:22:58] <icinga-wm>	 ACKNOWLEDGEMENT - Blazegraph Port on wdqs1009 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused Mathew.onipe wdqs-autodeployment causing this - T197187
[17:23:29] <icinga-wm>	 RECOVERY - Blazegraph Port on wdqs1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999
[17:26:18] <icinga-wm>	 PROBLEM - WDQS HTTP Port on wdqs1009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time
[17:26:20] <wikibugs>	 (03PS1) 10Gehel: wdqs: run autodeploy on the hour, 4 times per day [puppet] - 10https://gerrit.wikimedia.org/r/466700 (https://phabricator.wikimedia.org/T206597)
[17:27:27] <icinga-wm>	 RECOVERY - WDQS HTTP Port on wdqs1009 is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.058 second response time
[17:28:24] <gehel>	 fix coming up for wdqs1009, again, test server not critical
[17:28:40] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: run autodeploy on the hour, 4 times per day [puppet] - 10https://gerrit.wikimedia.org/r/466700 (https://phabricator.wikimedia.org/T206597) (owner: 10Gehel)
[17:29:25] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: run autodeploy on the hour, 4 times per day [puppet] - 10https://gerrit.wikimedia.org/r/466700 (https://phabricator.wikimedia.org/T206597) (owner: 10Gehel)
[17:29:31] <wikibugs>	 (03CR) 10Herron: [C: 04-1] "TIL what mjolnir is!  For starters we will want some ferm rules to permit connections from the prometheus servers to the mjolnir metrics l" [puppet] - 10https://gerrit.wikimedia.org/r/454644 (owner: 10EBernhardson)
[17:31:09] <icinga-wm>	 RECOVERY - High lag on wdqs1003 is OK: (C)3600 ge (W)1200 ge 1188 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[17:36:57] <gehel>	 !log repooling wdqs1003, catched up on lag
[17:36:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:37] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, thanks for this. Just double verify with the compiler on a couple of hosts." [puppet] - 10https://gerrit.wikimedia.org/r/466697 (owner: 10Gehel)
[17:38:49] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, thanks for this. Just double verify with the compiler on a couple of hosts." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466693 (owner: 10Gehel)
[17:41:20] <wikibugs>	 (03CR) 10Volans: [C: 031] "LGTM, thanks for this. Just double verify with the compiler on a couple of hosts." [puppet] - 10https://gerrit.wikimedia.org/r/466687 (owner: 10Gehel)
[17:43:48] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is CRITICAL: 56.96 le 60 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:44:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) is CRITICAL: Test Retrieve all events for Jan 15 returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016
[17:44:48] <icinga-wm>	 pected value at path = Missing keys: [umostread]
[17:45:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy
[17:48:07] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on einsteinium is OK: (C)60 le (W)70 le 75.74 https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T1800).
[18:00:04] <jouncebot>	 stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:20] <stephanebisson>	 hello
[18:00:38] <stephanebisson>	 I'll run the SWAT
[18:01:22] <wikibugs>	 (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari)
[18:02:47] <wikibugs>	 (03Merged) 10jenkins-bot: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari)
[18:08:06] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743)
[18:09:11] <marostegui>	 stephanebisson: I need to deploy ^ for a high priority ticket, can I sneak in and deploy?
[18:09:22] <logmsgbot>	 !log sbisson@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:466694|Add copyviobot group management to relevant wikis]] (duration: 00m 49s)
[18:09:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:09:37] <stephanebisson>	 marostegui: yep, go ahead I'll continue after
[18:09:50] <marostegui>	 stephanebisson: thank you, it should take a minute
[18:09:53] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[18:12:09] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[18:12:59] <wikibugs>	 (03PS2) 10Sbisson: Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033)
[18:13:12] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 (duration: 00m 49s)
[18:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:41] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Andrew) Hello @Gehel!  We're unlikely to support bare metal on Labs in the near future, largely because our...
[18:14:13] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2083 and db2085:3318 (duration: 00m 48s)
[18:14:13] <wikibugs>	 (03CR) 10Catrope: [C: 031] Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson)
[18:14:14] <marostegui>	 stephanebisson: I am all done, thanks a lot!
[18:14:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:49] <stephanebisson>	 marostegui: no worries, resuming SWAT
[18:15:27] <wikibugs>	 (03CR) 10Sbisson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson)
[18:15:42] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10akosiaris) >>! In T206740#4658660, @jcrespo wrote: > Forgetting the codfw -> eqiad replication was the most likely cause of overload on the applicatio...
[18:16:14] <wikibugs>	 (03CR) 10jenkins-bot: Add copyviobot group management to relevant wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466694 (https://phabricator.wikimedia.org/T206731) (owner: 10Kaldari)
[18:16:16] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Repool db1101:3318, db2085:3318, db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466710 (https://phabricator.wikimedia.org/T206743) (owner: 10Marostegui)
[18:17:07] <wikibugs>	 (03CR) 10Herron: [C: 031] "The check/retry intervals and prometheus query time frame seem a big long (in particular a bit of long time to wait for recovery) but from" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[18:17:14] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-Database, 10Datacenter-Switchover-2018: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Marostegui) >>! In T206740#4659202, @akosiaris wrote: >>>! In T206740#4658660, @jcrespo wrote: >> Forgetting the codfw -> eqiad replication was the mo...
[18:18:16] <wikibugs>	 (03Merged) 10jenkins-bot: Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson)
[18:18:37] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) Yes.  mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/ network::c...
[18:21:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Marostegui) >>! In T201343#4659214, @Dzahn wrote: > Yes. >  > mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia...
[18:22:08] <logmsgbot>	 !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:463319|Remove config for RCFilters variables being removed from Core]] (duration: 00m 49s)
[18:22:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:39] <wikibugs>	 (03CR) 10jenkins-bot: Remove config for RCFilters variables being removed from Core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463319 (https://phabricator.wikimedia.org/T196033) (owner: 10Sbisson)
[18:32:24] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) There is no urgency at all and it wasn't expected. I only listed what is left to be done. Please dont worry about this at all, especi...
[18:34:49] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) p:05High>03Normal lowering priority because mwmaint1002 is in production and the remaining steps are all just cleanup
[18:34:57] <logmsgbot>	 !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/PageTriage/modules/ext.pageTriage.views.list/ext.pageTriage.listControlNav.js: SWAT: [[gerrit:465677|Default to deleted and others when no type is selected on mode switch]] (duration: 00m 50s)
[18:34:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:06] <wikibugs>	 (03PS3) 10EBernhardson: Collect prometheus metrics from mjolnir [puppet] - 10https://gerrit.wikimedia.org/r/454644
[18:39:17] <wikibugs>	 (03CR) 10EBernhardson: "ferm rules added, and port changed to 9170/9171" [puppet] - 10https://gerrit.wikimedia.org/r/454644 (owner: 10EBernhardson)
[18:39:18] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 3603 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[18:39:57] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) 05Open>03Resolved Ok, let's keep the ticket within the original focus.. setting up mwmaint1002.  That is done.  Normally there wo...
[18:40:16] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn)
[18:46:16] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) a:05MoritzMuehlenhoff>03RobH
[18:46:22] <robh>	 andrewbogott: Heyas, you about?
[18:46:22] <logmsgbot>	 !log sbisson@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/PageTriage/: SWAT: [[gerrit:465676|Handle page that are unnominated for deletion]] (duration: 00m 50s)
[18:46:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:35] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, and 2 others: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) 05Open>03Resolved
[18:47:36] <stephanebisson>	 and that concludes the SWAT
[19:01:46] <wikibugs>	 (03PS3) 10Dzahn: Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645
[19:01:51] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "Compilation results for mwmaint2001.codfw.wmnet: no change" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn)
[19:03:55] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn)
[19:10:01] <wikibugs>	 (03PS1) 10Ladsgroup: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546)
[19:10:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup)
[19:14:47] <wikibugs>	 (03PS2) 10Ladsgroup: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546)
[19:15:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup)
[19:19:56] <wikibugs>	 (03PS3) 10Ladsgroup: ores: Add logstash config [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546)
[19:20:25] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "noop and double-checked:" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn)
[19:28:24] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Gehel) The main contention point for WDQS (or investigating alternatives) seems to be IOPS. We tried settin...
[19:29:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[19:35:24] <wikibugs>	 (03PS1) 10RobH: cloudvirt1023 to attempt to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/466721
[19:35:59] <wikibugs>	 (03CR) 10RobH: [C: 032] cloudvirt1023 to attempt to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/466721 (owner: 10RobH)
[19:36:21] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 44 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[19:38:32] <wikibugs>	 (03PS1) 10Gehel: wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423)
[19:48:32] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 6999 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:48:52] <wikibugs>	 (03CR) 10Gehel: "> Patch Set 4: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[19:50:40] <wikibugs>	 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Gehel) General question on how to deploy this kind of change:  This will most probably trip on a number of nodes (I know that a...
[19:50:42] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 7103 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:51:31] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 26 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[19:54:48] <wikibugs>	 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Volans) >>! In T206114#4659596, @Gehel wrote: > How do we ensure that a check like this does not generate too much noise when w...
[19:55:11] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 7341 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:55:36] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: uwsgi: Remove the uwsgi-dbg package [puppet] - 10https://gerrit.wikimedia.org/r/466723
[19:55:41] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) >>! In T199228#4655321, @Smalyshev wrote: > I think update lag is not the biggest...
[19:58:31] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 7491 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[19:58:51] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 58 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[19:59:08] <wikibugs>	 (03PS3) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342)
[20:01:30] <wikibugs>	 (03CR) 10Legoktm: [C: 031] "Seems fine, I didn't review all of the bash logic. I'm not sure whether it's worth maintaining all of that logic vs just hardcoding php7.0" [puppet] - 10https://gerrit.wikimedia.org/r/462748 (https://phabricator.wikimedia.org/T205313) (owner: 10Thcipriani)
[20:04:43] <wikibugs>	 (03CR) 10Gehel: "puppet compiler looks good: https://puppet-compiler.wmflabs.org/compiler1002/12877/" [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel)
[20:04:51] <wikibugs>	 (03PS2) 10Gehel: wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423)
[20:05:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) So, cloudvirt1023 is now installed and has puppet signed and running with jessie.
[20:05:35] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH)
[20:06:43] <icinga-wm>	 PROBLEM - High lag on wdqs1003 is CRITICAL: 7819 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[20:07:39] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel)
[20:07:54] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: use recent change poller on public cluster instead of kafka [puppet] - 10https://gerrit.wikimedia.org/r/466722 (https://phabricator.wikimedia.org/T206423) (owner: 10Gehel)
[20:10:14] <wikibugs>	 (03PS5) 10Mathew.onipe: base::monitoring::host: added icinga prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114)
[20:16:53] <icinga-wm>	 PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:20:20] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173 (10BBlack) 05Open>03Resolved a:03BBlack Yes, these certs are long-deployed :)
[20:22:01] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) a:05RobH>03Cmjohnson Chris,  Please install the replacement h730P when it arrives this Friday into cloudvirt1024, then assign this task back t...
[20:26:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "Will merge it tomorrow EU morning." [puppet] - 10https://gerrit.wikimedia.org/r/466716 (https://phabricator.wikimedia.org/T181546) (owner: 10Ladsgroup)
[20:26:23] <wikibugs>	 (03CR) 10Mathew.onipe: "> Patch Set 4: Code-Review+1" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[20:27:09] <wikibugs>	 (03Abandoned) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[20:27:40] <wikibugs>	 (03Restored) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[20:29:01] <wikibugs>	 (03PS3) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457)
[20:29:04] <wikibugs>	 (03CR) 10Mforns: "After chat with Andrew, this seems ready. However, still waiting to deploy this change first: https://gerrit.wikimedia.org/r/#/c/analytics" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns)
[20:29:04] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:29:56] <wikibugs>	 (03PS4) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457)
[20:35:32] <wikibugs>	 (03PS1) 10Dzahn: mediawiki_maintenance: switch home rsync to 1002->2001 [puppet] - 10https://gerrit.wikimedia.org/r/466731 (https://phabricator.wikimedia.org/T201343)
[20:36:10] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 52 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:37:02] <XioNoX>	 !log add IPv6 to mr1-ulsfo OOB - T206778
[20:37:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:05] <stashbot>	 T206778: Configure v6 OOB for ulsfo - https://phabricator.wikimedia.org/T206778
[20:40:32] <wikibugs>	 (03PS2) 10MarcoAurelio: Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049)
[20:42:00] <icinga-wm>	 RECOVERY - puppet last run on lvs5003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures
[20:42:18] <wikibugs>	 10Operations, 10Traffic: Renew GlobalSign Unified in 2018 - https://phabricator.wikimedia.org/T206804 (10BBlack) p:05Triage>03Normal
[20:45:43] <wikibugs>	 (03PS1) 10Dzahn: re-add mw1297 to site.pp and DHCP, formerly mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185)
[20:46:11] <wikibugs>	 (03PS1) 10MarcoAurelio: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201)
[20:47:39] <wikibugs>	 (03CR) 10Muehlenhoff: re-add mw1297 to site.pp and DHCP, formerly mwmaint1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn)
[20:49:23] <wikibugs>	 (03PS2) 10MarcoAurelio: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201)
[20:51:11] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 26 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:51:11] <wikibugs>	 (03CR) 10Dzahn: "and also removing mwmaint1001 from site in the same patch" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn)
[20:52:50] <wikibugs>	 (03PS2) 10Dzahn: re-add mw1297 to site.pp and DHCP, remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185)
[20:53:17] <wikibugs>	 (03PS1) 10MarcoAurelio: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741)
[20:55:15] <wikibugs>	 (03CR) 10Dzahn: "so... schedule downtime, shut down, change DNS.. wait 1H .., change DHCP and site, run reimage script? sounds right?" [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn)
[20:55:37] <wikibugs>	 (03PS2) 10MarcoAurelio: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741)
[20:56:00] <wikibugs>	 (03PS3) 10MarcoAurelio: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741)
[20:56:30] <wikibugs>	 (03PS7) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785)
[20:56:59] <wikibugs>	 (03CR) 10Herron: smarthost: create mail smarthost role/profile (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[20:57:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] re-add mw1297 to site.pp and DHCP, remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn)
[20:58:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[20:58:31] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 57 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:59:47] <Hauskatze>	 jouncebot: refresh
[20:59:48] <jouncebot>	 I refreshed my knowledge about deployments.
[20:59:51] <Hauskatze>	 jouncebot: next
[20:59:51] <jouncebot>	 In 2 hour(s) and 0 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T2300)
[21:11:52] <wikibugs>	 (03PS1) 10Ayounsi: Add v6 OOB IP for mr1-ulsfo [dns] - 10https://gerrit.wikimedia.org/r/466783 (https://phabricator.wikimedia.org/T206778)
[21:12:53] <pmiazga>	 hey, I have a quick question and honestly I have no idea where to ask. I think that sometime ago I heard that the web crawlers (like google search) are logged in when the crawl wikipedias. Is it true or did I misheard something?
[21:17:16] <Reedy>	 orly
[21:18:23] <c>	 why would they need to be logged-in? it wouldn't really matter either way?
[21:19:27] <bblack>	 in general, crawlers aren't logged-in, no.  at least not the big ones we observe
[21:20:43] <Reedy>	 pmiazga: Some of them don't even have useful user agents
[21:21:10] <wikibugs>	 (03PS1) 10Ayounsi: Icinga, add mr1-ulsfo IPv6 OOB [puppet] - 10https://gerrit.wikimedia.org/r/466787 (https://phabricator.wikimedia.org/T206778)
[21:21:32] <Platonides>	 I think it's the first time I hear that
[21:21:40] <Platonides>	 they are probably *identified*
[21:21:44] <pmiazga>	 ok, so I had to misheard something
[21:21:56] <Platonides>	 as in using a proper User-Agent and from their company ip addresses
[21:22:04] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10User-Urbanecm: Non-working archive for wikimediacz-l list - https://phabricator.wikimedia.org/T205380 (10Urbanecm) Started to work magically...
[21:22:08] <Platonides>	 so we could know they are them
[21:22:20] <Platonides>	 maybe that was the source of the confusion
[21:22:22] <pmiazga>	 np, nah, we're working on a feature that is related to the search engines (we're adding the json-ld schema definition)
[21:22:50] <pmiazga>	 and I heard that we can identify the search engines, that was most probably it. I don't know why I thought they are logged in. thanks for quick answer guys!
[21:23:51] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 30 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:27:59] <wikibugs>	 (03CR) 10Dzahn: [C: 032] re-add mw1297 to site.pp and DHCP, remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/466773 (https://phabricator.wikimedia.org/T192185) (owner: 10Dzahn)
[21:30:49] <wikibugs>	 (03CR) 10Dzahn: [C: 032] mediawiki_maintenance: switch home rsync to 1002->2001 [puppet] - 10https://gerrit.wikimedia.org/r/466731 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[21:31:10] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:41:29] <mutante>	 !log mwmaint2001 - deleting 60G of unneeded files from home
[21:41:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:42:48] <Hauskatze>	 wow 60G
[21:44:05] <mutante>	 Hauskatze: server side uploads of large videos to commons i think , heh
[21:44:18] <mutante>	 when they are too large for normal upload
[21:44:32] <Hauskatze>	 yep, I know the process :)
[21:44:42] <Hauskatze>	 still limited to 5G even server-side, right?
[21:45:01] <Hauskatze>	 [[Uploading large files]] was the wikitech doc iirc
[21:45:25] <mutante>	 i am not sure what the current limit is.. *nod*
[21:45:32] <Hauskatze>	 "MediaWiki currently doesn't support files greater than 4 GB (as size is stored as a 32 bits unsigned integer) while our swift backend storage is limited to 5 Gb. See phab:T191804 and phab:T191802 for discussion to extend this limit respectively to 5 GB and beyond."
[21:45:32] <stashbot>	 T191804: Allow to store files between 4 and 5 Gb - https://phabricator.wikimedia.org/T191804
[21:45:33] <stashbot>	 T191802: [Epic] Determine a strategy to store files between 5 and 100 Gb - https://phabricator.wikimedia.org/T191802
[21:45:52] <Hauskatze>	 whether that is true or not, I cannot tell
[21:46:01] <Hauskatze>	 cfr. https://wikitech.wikimedia.org/wiki/Uploading_large_files
[21:47:28] <mutante>	 !log mwmaint2001 - rsyncing home dirs from mwmaint1002 to /root/home-mwmaint1002 (which includes home-terbium even!) in case anyone is missing anything from one of mwaint*
[21:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:02] <mutante>	 these were Astronomy things in HD, B.rion used to deal with the huge files
[21:48:21] <mutante>	 i had confirmed they were not needed anymore
[21:48:32] <mutante>	 slashes the size in half or so
[21:49:59] <wikibugs>	 (03PS4) 10MarcoAurelio: Close chairwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/443585 (https://phabricator.wikimedia.org/T184961)
[21:53:57] <mutante>	 !log mwmaint1001 - schduled downtime, is being renamed back to mw1297 and reinstalled 
[21:53:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:19] <mutante>	 Hauskatze: one of them is "Thermonuclear Art in Ultra-HD", The Sun , heh   https://commons.wikimedia.org/wiki/File:NASA_Thermonuclear_Art_%E2%80%93_The_Sun_In_Ultra-HD_(4K)_(1080p).webm
[21:58:23] <Hauskatze>	 mutante: so that's what'd happen if we nuked a country... right :P
[21:59:04] <mutante>	 https://svs.gsfc.nasa.gov/12034
[22:01:00] <Hauskatze>	 mutante: some day we will be able to store those locally w/o the need to reconvert or resize them :)
[22:02:21] <mutante>	 Hauskatze: on IPFS?  
[22:02:43] <mutante>	 would be fitting especially for Astronomy
[22:04:11] <Hauskatze>	 mutante: I mean, on commons
[22:04:29] <Hauskatze>	 jouncebot: next
[22:04:30] <jouncebot>	 In 0 hour(s) and 55 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T2300)
[22:04:44] <Hauskatze>	 sigh
[22:04:47] <Hauskatze>	 still an hour
[22:04:53] <mutante>	 Hauskatze: ah, yes :)
[22:04:54] * Hauskatze feels asleep
[22:06:48] <mutante>	 rsyncs from 2 mw servers and needs de-duplication
[22:17:31] <mutante>	 gzipping some large files and finding more large files worth asking about
[22:20:12] <Hauskatze>	 mutante: we're back at eqiad right?
[22:21:07] <mutante>	 Hauskatze: yes
[22:21:38] <Hauskatze>	 mutante: okay, it's because a patch I have for swat needs a maintenance script run to add some tables
[22:21:55] <mutante>	 Hauskatze: the right server will be mmaint1002
[22:22:01] <mutante>	 mwmaint1002
[22:22:25] <mutante>	 that being said i dont know if "add some tables" is ok 
[22:22:49] <Hauskatze>	 it's a swattable change, I've done it before
[22:22:54] <Hauskatze>	 for ShortUrl
[22:23:41] <mutante>	 is it a schema change? there is a special process for those
[22:25:23] <Hauskatze>	 mutante: no, no schema change
[22:25:33] <mutante>	 ok, good
[22:25:45] <Hauskatze>	 https://phabricator.wikimedia.org/diffusion/EWMA/browse/master/createExtensionTables.php$117
[22:29:46] <wikibugs>	 (03CR) 10MarcoAurelio: "Requires:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio)
[22:30:02] <mutante>	 !log mwmaint1001 - shutting down after final backup of /home, renaming back to mw1297 in DNS and DHCP, and reinstalling  (T192457)
[22:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:30:08] <stashbot>	 T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457
[22:41:45] * Krinkle staging on mwdebug1001
[22:42:36] <wikibugs>	 (03PS5) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457)
[22:45:14] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/Revision/RenderedRevision.php: I553dba13486 (duration: 00m 51s)
[22:45:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:03] <wikibugs>	 (03CR) 10Dzahn: [C: 032] Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[22:48:40] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "gone from Icinga and shut down" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[22:50:35] <mutante>	 !log netbox - renamed mwmaint1001 to mw1279, changed status to inventory, renamed in DNS -  T192457
[22:50:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:50:41] <stashbot>	 T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457
[22:53:17] <mutante>	 !log netbox - correction, mwmaint1001 to status "Staged", following new lifecycle docs T192457
[22:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:54:27] <wikibugs>	 (03CR) 10MarcoAurelio: "Also:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio)
[22:56:18] <logmsgbot>	 !log dzahn@neodymium conftool action : set/pooled=inactive; selector: name=mwmaint1001.eqiad.wmnet
[22:56:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181011T2300).
[23:00:05] <jouncebot>	 Hauskatze: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:11] <Hauskatze>	 Yes I'm here.
[23:01:08] <Krinkle>	 Sorry, I'm rolling out a backport still. 1/2 is out already, the other is stuck in Jenkins for 27 minutes.
[23:01:33] <Hauskatze>	 :|
[23:01:35] <Krinkle>	 33 min*
[23:01:38] <Hauskatze>	 jenkins...
[23:01:45] <rxy>	 lol
[23:01:57] <Hauskatze>	 I want to go to bed :(
[23:02:59] <Hauskatze>	 Krinkle: how much do you estimate it'll take to finish?
[23:03:18] <Krinkle>	 I don't know. I thought 20min was long.
[23:03:39] <Krinkle>	 Is yours mw or config?
[23:04:06] <Krinkle>	 k, done, rolling out now
[23:04:31] <Hauskatze>	 Krinkle: mediawiki-config
[23:04:36] <rxy>	 Scheduling a SWAT  is not suitable before when you go to bed or another important schedule because sometimes SWAT is cancelled or troubled.
[23:05:16] <Hauskatze>	 rxy: I can wait, but in any case this is the only time in the day I can attend a window
[23:05:24] <Hauskatze>	 so it's this or nothing
[23:05:40] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/jobqueue/jobs/ThumbnailRenderJob.php: T203135 - Ib4640eb13ca93f (duration: 00m 49s)
[23:05:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:05:43] <stashbot>	 T203135: ThumbnailRender job fails with 429 errors - https://phabricator.wikimedia.org/T203135
[23:06:21] * Krinkle is done
[23:06:30] <Krinkle>	 Whoever takes swat today, go ahead :)
[23:06:47] <Hauskatze>	  --- and then the silence was made ---
[23:07:03] <Hauskatze>	 :)
[23:07:42] <Krinkle>	 most recently active were RoanKattouw and dereckson. If no ping in 10min, I can do it as well. I'm just taking a break for a few minutes first.
[23:08:47] <Hauskatze>	 Sure, take care
[23:09:00] <rxy>	 take care :)
[23:10:13] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts: ``` ['mw1297.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201810112309_d...
[23:14:07] <wikibugs>	 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10faidon) How do these packet losses manifest? Are we talking about packets being lost in flight, error counters in interfaces, o...
[23:17:51] <wikibugs>	 10Operations, 10DBA, 10JADE, 10MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)), and 3 others: Write our anticipated "phase two" schemas and submit for review - https://phabricator.wikimedia.org/T202596 (10awight) @Marostegui These are the proposed indexes, if you want to discuss something concrete: h...
[23:17:58] <wikibugs>	 (03CR) 10Reedy: [C: 032] Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049) (owner: 10MarcoAurelio)
[23:18:35] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts: ``` ['mw1297.eqiad.wmnet'] ``` The log can be found in `/var/log/wmf-auto-reimage/201810112318_d...
[23:20:24] <wikibugs>	 (03Merged) 10jenkins-bot: Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049) (owner: 10MarcoAurelio)
[23:21:14] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] smarthost: create mail smarthost role/profile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[23:21:25] <wikibugs>	 (03PS3) 10Reedy: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio)
[23:21:54] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Disable CongressLookup (duration: 00m 49s)
[23:21:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:04] <wikibugs>	 (03CR) 10Reedy: [C: 032] Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio)
[23:24:58] <wikibugs>	 (03Merged) 10jenkins-bot: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio)
[23:25:23] <Krinkle>	 Reedy: thx
[23:26:56] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable FileExporter to Meta-Wiki (duration: 00m 49s)
[23:26:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:05] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[23:27:52] <wikibugs>	 (03PS4) 10Reedy: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio)
[23:27:54] <wikibugs>	 (03CR) 10Faidon Liambotis: [C: 04-1] smarthost: create mail smarthost role/profile (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[23:28:58] <wikibugs>	 (03CR) 10Reedy: [C: 032] Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio)
[23:30:17] <Reedy>	 !log created shorturl table on gomwiki T206741
[23:30:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:20] <stashbot>	 T206741: Enable extension ShortURL for the Konkani Wikipedia - https://phabricator.wikimedia.org/T206741
[23:31:12] <wikibugs>	 (03Merged) 10jenkins-bot: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio)
[23:32:23] <logmsgbot>	 !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable shorturl on gomwiki (duration: 00m 48s)
[23:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:20] <Reedy>	 !log ran mwscript extensions/ShortUrl/populateShortUrlTable.php --wiki=gomwiki T206741
[23:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:33:33] <Hauskatze>	 all set Reedy ?
[23:33:36] <wikibugs>	 (03CR) 10jenkins-bot: Disable CongressLookup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/462173 (https://phabricator.wikimedia.org/T205049) (owner: 10MarcoAurelio)
[23:33:38] <wikibugs>	 (03CR) 10jenkins-bot: Enable FileExporter to Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466774 (https://phabricator.wikimedia.org/T205201) (owner: 10MarcoAurelio)
[23:33:39] <Reedy>	 Yeah, all done
[23:33:40] <wikibugs>	 (03CR) 10jenkins-bot: Enable ShortURL on gom.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/466778 (https://phabricator.wikimedia.org/T206741) (owner: 10MarcoAurelio)
[23:33:47] <Hauskatze>	 Reedy: thanks a lot :)
[23:34:15] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[23:51:39] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) mw1297: done, renamed in DNS/DHCP, reinstalled, in Icinga again, renamed in netbox, changed netbox status to "Staged" per new lifecycle docs  https://icinga.wikimedia.org/cgi-bin/icinga/statu...
[23:52:25] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn)
[23:53:15] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1297.eqiad.wmnet'] ```  and were **ALL** successful.
[23:54:04] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn)
[23:54:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 34 probes of 319 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts