[00:13:13] <wikibugs>	 (03PS1) 10Bstorm: openstreetmap: add debian stretch to puppet role [puppet] - 10https://gerrit.wikimedia.org/r/444771 (https://phabricator.wikimedia.org/T197246)
[00:20:04] <wikibugs>	 (03CR) 10Bstorm: [C: 032] openstreetmap: add debian stretch to puppet role [puppet] - 10https://gerrit.wikimedia.org/r/444771 (https://phabricator.wikimedia.org/T197246) (owner: 10Bstorm)
[00:21:01] <twentyafterfour>	 !log deploying https://phabricator.wikimedia.org/rPHABc6f75c918afa1cc59472c5fe226539e093f6c3ef
[00:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:38:04] <wikibugs>	 (03PS1) 10Reedy: Remove wgTidyConfig; same as DefaultSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444775
[00:42:44] <wikibugs>	 (03PS1) 10Reedy: Timeless is enabled everywhere, remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444776
[00:56:12] <MatmaRex>	 hi. i've been banned from phabricator? :(
[00:56:16] <MatmaRex>	 TOO MANY REQUESTS
[00:56:16] <MatmaRex>	 You ("185.157.12.102") are issuing too many requests too quickly.
[00:56:33] <MatmaRex>	 that is all i get when trying to view any page. i was just browsing it like every day.
[00:56:59] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[01:00:35] <MatmaRex>	 well, i can view it again. i don't know if you did anything or if it fixed itself.
[01:00:53] <MatmaRex>	 i'd still be curious to know how i managed to hit a rate limit
[01:01:20] <ebernhardson>	 :S
[01:02:42] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] "Seems to work ok for tests" [puppet] - 10https://gerrit.wikimedia.org/r/444265 (owner: 10Smalyshev)
[01:11:22] <wikibugs>	 (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm)
[01:12:00] <icinga-wm>	 PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received
[01:15:19] <icinga-wm>	 RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy
[01:21:49] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:22:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[01:31:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (bad seed) timed out before a response was received
[01:33:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[01:34:12] <wikibugs>	 (03PS10) 10Zoranzoki21: Create Publisher namespace in Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028)
[01:35:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received
[01:35:51] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 031] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm)
[01:37:59] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:39:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[01:42:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[01:45:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:48:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[01:49:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received
[01:53:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[01:55:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:57:09] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[01:57:39] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[01:58:10] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[02:00:59] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:01:29] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:02:09] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[02:04:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[02:05:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:08:09] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:09:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[02:11:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[02:13:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received
[02:17:10] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:18:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[02:22:10] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:25:29] <icinga-wm>	 PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[02:25:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[02:26:19] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[02:27:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[02:28:50] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:30:30] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:34:19] <icinga-wm>	 RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5
[02:35:00] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[02:38:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[02:43:29] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:44:29] <logmsgbot>	 !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jul 10 02:44:29 UTC 2018 (duration 10m 18s)
[02:44:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:46:10] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[02:50:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[02:57:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[03:00:39] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[03:02:40] <icinga-wm>	 PROBLEM - eventstreams on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 8092: Connection refused
[03:03:29] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[03:04:00] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[03:06:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[03:10:49] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200)
[03:11:19] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy
[03:11:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy
[03:21:09] <icinga-wm>	 PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 5 minutes ago with 19 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[recommendation-api/deploy],Exec[chown /srv/deployment/recommendation-api for deploy-service],Package[mobileapps/deploy]
[03:24:50] <icinga-wm>	 RECOVERY - eventstreams on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.093 second response time
[03:29:57] <Josve05a>	 TOO MANY REQUESTS You ("217.209.178.82") are issuing too many requests too quickly.  ---- lol
[03:31:40] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[03:43:44] <MatmaRex>	 apparently there's a task about the phab rate limit i was complaining about: https://phabricator.wikimedia.org/T198974
[03:43:51] <MatmaRex>	 Josve05a: ^
[03:44:44] <Josve05a>	 i was creating a task/ticket with exception crash code :/ All lost
[03:46:39] <icinga-wm>	 RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures
[04:13:41] <davidwbarratt>	 uhh, somehow I've hit the rate limit for Phabricator (just normal browsing)
[04:13:47] <davidwbarratt>	 any ideas?
[04:14:11] <davidwbarratt>	 When I visit any page I get the error:
[04:14:17] <davidwbarratt>	 TOO MANY REQUESTS
[04:14:34] <davidwbarratt>	 You (IP ADDRESS REDACTED) are issuing too many requests too quickly.
[04:14:53] <davidwbarratt>	 hmm now it works (so far)
[04:15:23] <AntiComposite>	 T198974
[04:15:24] <stashbot>	 T198974: Rate-limit is too harsh - https://phabricator.wikimedia.org/T198974
[04:15:57] <AntiComposite>	 You are not the first person to run into this tonight, the time period on the rate limit is fairly short though
[04:16:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0
[04:19:20] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0
[04:22:06] <librenms-wmf>	 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors
[04:23:10] <davidwbarratt>	 ah interesting
[04:28:57] <marostegui>	 !log Deploy schema change on s1 primary master (db1052) T146591 T197891 T196379
[04:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:29:02] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[04:29:03] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[04:29:03] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[04:36:43] <marostegui>	 !log Optimize bgwiki itwiki svwiki zhwiki wbc_entity_usage on db1066 (s2 primary master) - T187521
[04:36:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:36:47] <stashbot>	 T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521
[04:40:51] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591)
[04:45:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) 05Resolved>03Open Actually it this disk has smart errors too. Was this a re-used or a new disk, @Cmjohnson?  ``` PD: 0 Information Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGrou...
[04:45:21] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,0 instance=db1069:9100 job=node site=eqiad Marostegui T199056 - The acknowledgement expires at: 2018-07-13 04:45:07. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops
[04:45:37] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[04:47:20] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[04:48:43] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 for alter table (duration: 00m 52s)
[04:48:45] <marostegui>	 !log Deploy schema change on db1084 T146591 T197891 T196379
[04:48:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:48:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:48:50] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[04:48:51] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[04:48:51] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[04:50:01] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783
[04:53:58] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783 (owner: 10Marostegui)
[04:55:12] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783 (owner: 10Marostegui)
[04:57:28] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 after alter table (duration: 00m 50s)
[04:57:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:58:41] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591)
[04:59:28] <marostegui>	 !log Optimize frwiki.wbc_entity_usage on s6 codfw, this will generate lag on s6 codfw - T187521
[04:59:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:32] <stashbot>	 T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521
[05:01:01] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:01:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0
[05:01:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0
[05:02:46] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:03:56] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 for alter table (duration: 00m 51s)
[05:03:58] <marostegui>	 !log Deploy schema change on db1097:3314 T146591 T197891 T196379
[05:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:04] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[05:04:04] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[05:04:05] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[05:04:17] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785
[05:07:05] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785 (owner: 10Marostegui)
[05:08:52] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785 (owner: 10Marostegui)
[05:10:49] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 after alter table (duration: 00m 50s)
[05:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:16] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591)
[05:13:52] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:15:05] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:15:24] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789
[05:16:22] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 for alter table (duration: 00m 50s)
[05:16:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:16:41] <marostegui>	 !log Deploy schema change on db1103:3314 T146591 T197891 T196379
[05:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:16:48] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[05:16:48] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[05:16:49] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[05:17:21] <marostegui>	 !log Optimize frwiki.wbc_entity_usage on s6 eqiad hosts T187521
[05:17:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:17:24] <stashbot>	 T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521
[05:17:46] <wikibugs>	 (03CR) 10Krinkle: "Might be better to use unprefixed variables for these, or wmg* prefix. That way they won't clash with wg* and/or be accessible through Con" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE))
[05:17:55] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Do not leak local $wgWBShared… variables to th eglobal scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE))
[05:18:50] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789 (owner: 10Marostegui)
[05:20:03] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789 (owner: 10Marostegui)
[05:21:08] <marostegui>	 !log Deploy schema change on db1121 with replication, this will generate lag on s4 labs hosts T146591 T197891 T196379
[05:21:10] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 after alter table (duration: 00m 50s)
[05:21:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:21:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:22:06] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors
[05:23:03] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591)
[05:25:02] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:26:12] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:27:23] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 for alter table (duration: 00m 50s)
[05:27:24] <marostegui>	 !log Deploy schema change on db1081 T146591 T197891 T196379
[05:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:27:30] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[05:27:30] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[05:27:30] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[05:28:03] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791
[05:29:28] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791 (owner: 10Marostegui)
[05:31:04] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791 (owner: 10Marostegui)
[05:32:10] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591)
[05:32:10] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 after alter table (duration: 00m 49s)
[05:32:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:33:57] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:35:41] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[05:35:59] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793
[05:36:41] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 for alter table (duration: 00m 50s)
[05:36:42] <marostegui>	 !log Deploy schema change on db1091 T146591 T197891 T196379
[05:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:36:48] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[05:36:48] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[05:36:48] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[05:37:50] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793 (owner: 10Marostegui)
[05:39:33] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793 (owner: 10Marostegui)
[05:39:59] <marostegui>	 !log Deploy schema change on s4 primary master (db1068) T146591 T197891 T196379
[05:40:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:44] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 after alter table (duration: 00m 50s)
[05:40:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:29] <icinga-wm>	 RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational
[06:02:16] <marostegui>	 !log Deploy schema change on codfw s8 master (db2045) with replication, this will generate lag on s8 codfw T146591 T197891 T196379
[06:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:22] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[06:02:22] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[06:02:22] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[06:04:29] <icinga-wm>	 PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:06:35] <marostegui>	 !log Deploy schema change on dbstore1002:s8 T146591 T197891 T196379
[06:06:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:36] <wikibugs>	 (03PS1) 10Elukey: Add interface::add_ip6_mapped to stat* hosts [puppet] - 10https://gerrit.wikimedia.org/r/444795 (https://phabricator.wikimedia.org/T199180)
[06:07:44] <elukey>	 poor dbstore1002, nobody leaves it alone :D
[06:08:33] <marostegui>	 it is good for it!
[06:12:01] <elukey>	 hahahaha
[06:12:13] <marostegui>	 new indexes, PKs...
[06:12:18] <marostegui>	 all good for our dbstore1002
[06:16:40] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to stat* hosts [puppet] - 10https://gerrit.wikimedia.org/r/444795 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey)
[06:17:43] <marostegui>	 !log Deploy schema change on s8 primary master (db1071) T146591 T197891 T196379
[06:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:17:48] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[06:17:48] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[06:17:49] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[06:21:00] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[06:23:50] <marostegui>	 !log Deploy schema change on codfw s7 master (db2040) with replication, this will generate lag on s7 codfw T146591 T197891 T196379
[06:23:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:55] <stashbot>	 T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379
[06:23:56] <stashbot>	 T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891
[06:23:56] <stashbot>	 T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591
[06:29:30] <icinga-wm>	 PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl],File[/usr/local/lib/nagios/plugins/check_long_procs]
[06:32:37] <marostegui>	 !log Optimize wbc_entity_usage on arwiki cawiki huwiki rowiki ukwiki on s7 codfw master (db2040) with replication, this will generate lag on s7 codfw - T187521
[06:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:40] <stashbot>	 T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521
[06:33:50] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[06:37:35] <wikibugs>	 (03PS1) 10Elukey: Add interface::add_ip6_mapped to analytics100[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/444796 (https://phabricator.wikimedia.org/T199180)
[06:39:03] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to analytics100[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/444796 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey)
[06:39:31] <icinga-wm>	 RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:49:18] <wikibugs>	 (03PS1) 10Jcrespo: mariadb package: Update stretch package to the latest version [software] - 10https://gerrit.wikimedia.org/r/444797
[06:49:48] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] mariadb package: Update stretch package to the latest version [software] - 10https://gerrit.wikimedia.org/r/444797 (owner: 10Jcrespo)
[06:53:19] <twentyafterfour>	 !log deployed rPHEX03173dd0097451f60faa3a1705abee58a9fe4c5f
[06:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:44] <wikibugs>	 (03PS1) 10Jcrespo: Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798
[06:54:35] <wikibugs>	 (03PS1) 10Elukey: Add interface::add_ip6_mapped to analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/444799 (https://phabricator.wikimedia.org/T199180)
[06:56:24] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/444799 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey)
[06:58:03] <wikibugs>	 (03CR) 10Marostegui: [C: 031] Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo)
[06:58:16] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: start splitting up remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/444184
[06:58:18] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: unify the small private wikis definitions [puppet] - 10https://gerrit.wikimedia.org/r/444185
[06:58:20] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: move private wikis to a separate virtual host [puppet] - 10https://gerrit.wikimedia.org/r/444186
[06:58:22] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mediawiki: split all of remnant.conf into individual vhosts [puppet] - 10https://gerrit.wikimedia.org/r/444187
[06:58:24] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki_test: split wikimania.conf [puppet] - 10https://gerrit.wikimedia.org/r/444240
[06:58:27] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mediawiki_test: complete the transition to one wiki per template. [puppet] - 10https://gerrit.wikimedia.org/r/444241
[07:00:01] <wikibugs>	 (03CR) 10Jcrespo: "I am setting up the backups for the zarceillo database tables before I can deploy it." [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo)
[07:00:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki_test: complete the transition to one wiki per template. [puppet] - 10https://gerrit.wikimedia.org/r/444241 (owner: 10Giuseppe Lavagetto)
[07:04:30] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987)
[07:07:47] <wikibugs>	 (03CR) 10Jcrespo: "Let's deploy this and let's take at least one correct backup before deploying gerrit:444798." [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[07:07:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Update account date for mpany, remove two fund raising contractors [puppet] - 10https://gerrit.wikimedia.org/r/444801
[07:08:40] <icinga-wm>	 PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init]
[07:09:31] <elukey>	 this is me, fixing in a sec --^
[07:09:31] <wikibugs>	 (03PS1) 10Elukey: network::constants: update ip6 addresses for hadoop master nodes [puppet] - 10https://gerrit.wikimedia.org/r/444802 (https://phabricator.wikimedia.org/T199180)
[07:10:25] <wikibugs>	 (03CR) 10Elukey: [C: 032] network::constants: update ip6 addresses for hadoop master nodes [puppet] - 10https://gerrit.wikimedia.org/r/444802 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey)
[07:12:23] <wikibugs>	 (03CR) 10Marostegui: [C: 031] mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[07:13:34] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki: start splitting up remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/444184
[07:13:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: start splitting up remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/444184 (owner: 10Giuseppe Lavagetto)
[07:16:02] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987)
[07:16:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo)
[07:18:34] <librenms-wmf>	 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Critical syslog messages
[07:18:50] <icinga-wm>	 RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[07:19:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Debian conditional for prometheus-mysqld-exporter service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/444804
[07:22:40] <elukey>	 so show log messages on asw2-a returns stuff like
[07:22:48] <elukey>	 Jul 10 07:13:37  asw2-a-eqiad fpc8 Rear QSFP+ PIC Chan# 3: Tx laser fault cleared
[07:22:51] <elukey>	 Jul 10 07:13:37  asw2-a-eqiad fpc8 Rear QSFP+ PIC Chan# 3: Rx loss cleared
[07:24:00] <elukey>	 akosiaris, paravoid --^
[07:24:13] <elukey>	 it is probably nothing but better to triple check :)
[07:25:21] <vgutierrez>	 elukey: you logged in after seeing the critical message here, right?
[07:26:25] <elukey>	 yep yep, I am aware of the issue happened to Chase
[07:26:29] <vgutierrez>	 ack :)
[07:26:55] <elukey>	 https://librenms.wikimedia.org/device/device=160/tab=health/metric=storage/ would probably need a check too (just seen it passing by)
[07:27:27] <vgutierrez>	 it's kinda weird.. the alert event triggered at 07:18 according to https://librenms.wikimedia.org/device/device=160/tab=logs/section=eventlog/
[07:27:43] <vgutierrez>	 2018-07-10 07:18:01    System    Issued critical alert for rule 'Critical syslog messages' to transport 'irc'
[07:28:10] <elukey>	 it might collect stuff from prev mins and then if something looks weird it alarms?
[07:28:18] <vgutierrez>	 but the last critical is from 07:13
[07:28:34] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Critical syslog messages
[07:28:39] <vgutierrez>	 elukey: probably
[07:28:57] <elukey>	 https://librenms.wikimedia.org/alert-rules/
[07:29:05] <elukey>	 %syslog.timestamp >= %macros.past_5m && %syslog.priority = "crit" && %syslog.msg !~ "preauth" && %syslog.msg !~ "ipc_version_icu_bypass"
[07:29:18] <elukey>	 vgutierrez: --^
[07:29:43] <elukey>	 all right seems nothing happened, will wait for the experts to confirm :)
[07:29:47] <elukey>	 Cc: XioNoX 
[07:29:58] <elukey>	 (brb)
[07:33:13] <wikibugs>	 (03CR) 10Volans: Remove .hosts files, update tendril instead (031 comment) [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo)
[07:33:58] <twentyafterfour>	 !log deploying fix for phabricator userpage having hard-coded my username 
[07:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:45] <wikibugs>	 (03PS2) 10Muehlenhoff: Add Debian conditional for prometheus-mysqld-exporter service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/444804
[07:36:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add Debian conditional for prometheus-mysqld-exporter service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/444804 (owner: 10Muehlenhoff)
[07:36:31] <wikibugs>	 10Operations, 10Phabricator: Getting 'TOO MANY REQUESTS' error - https://phabricator.wikimedia.org/T199184 (10MarcoAurelio) I guess this is implemented via some puppet config. Please revert if I am mistaken.
[07:39:43] <wikibugs>	 (03PS2) 10Muehlenhoff: Update account date for mpany, remove two fund raising contractors [puppet] - 10https://gerrit.wikimedia.org/r/444801
[07:41:09] <icinga-wm>	 RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:42:51] <wikibugs>	 (03PS2) 10Vgutierrez: [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631
[07:43:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Update account date for mpany, remove two fund raising contractors [puppet] - 10https://gerrit.wikimedia.org/r/444801 (owner: 10Muehlenhoff)
[07:43:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez)
[07:44:20] <wikibugs>	 (03CR) 10Vgutierrez: [WIP] get rid of openssl CLI usage (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez)
[07:46:39] <wikibugs>	 (03PS1) 10Elukey: Add interface::add_ip6_mapped to all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/444805 (https://phabricator.wikimedia.org/T199180)
[07:47:21] <wikibugs>	 (03PS2) 10Elukey: Add interface::add_ip6_mapped to all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/444805 (https://phabricator.wikimedia.org/T199180)
[07:49:54] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/444805 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey)
[07:50:26] <elukey>	 moritzm: can I merge yours too?
[07:51:05] <wikibugs>	 (03CR) 10Jcrespo: Remove .hosts files, update tendril instead (031 comment) [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo)
[07:54:54] <moritzm>	 ah, yes. please go ahead
[07:58:09] <moritzm>	 !log installing ntp security updates on trusty (may trigger some Icinga warnings about clocks, these recover after a while)
[07:58:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:13] <elukey>	 ack!
[08:03:03] <wikibugs>	 10Operations, 10Phabricator: Getting 'TOO MANY REQUESTS' error - https://phabricator.wikimedia.org/T199184 (10Jc86035)
[08:03:30] <wikibugs>	 10Operations, 10Phabricator: Rate-limit is too harsh - https://phabricator.wikimedia.org/T198974 (10Jc86035)
[08:04:49] <wikibugs>	 (03PS1) 10Joal: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/444807
[08:04:58] <joal>	 elukey: --^
[08:05:02] <joal>	 if you have a minute
[08:05:10] <wikibugs>	 10Operations, 10Phabricator: Rate-limit is too harsh - https://phabricator.wikimedia.org/T198974 (10Jc86035)
[08:07:03] <wikibugs>	 10Operations, 10Phabricator: Rate-limit is too harsh and affects human users - https://phabricator.wikimedia.org/T198974 (10Jc86035)
[08:08:00] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/444807 (owner: 10Joal)
[08:08:26] <wikibugs>	 (03PS1) 10Jcrespo: wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808
[08:08:38] <moritzm>	 !log installing openslp security updates on trusty
[08:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:51] <wikibugs>	 (03CR) 10Jcrespo: "Happy now?" [puppet] - 10https://gerrit.wikimedia.org/r/444808 (owner: 10Jcrespo)
[08:09:05] <godog>	 !log disable puppet on hosts running cassandra before merging 444247 and 443114
[08:09:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808 (owner: 10Jcrespo)
[08:09:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:09:25] <wikibugs>	 (03PS6) 10Filippo Giunchedi: restbase: cleanup remaining detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[08:09:49] <wikibugs>	 10Operations, 10Phabricator: Rate-limit is too harsh and affects human users - https://phabricator.wikimedia.org/T198974 (10Jc86035)
[08:09:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] restbase: cleanup remaining detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans)
[08:09:58] <elukey>	 joal: merged but before rolling out we'd need to wait a sec for Filippo rolling out --^
[08:10:36] <godog>	 elukey: hah! feel free to reenable puppet where you want btw, I disabled as a precaution but it is a noop on e.g. aqs
[08:10:54] <elukey>	 ah okok! Will run it there and check then :)
[08:11:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Disable cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/444247 (https://phabricator.wikimedia.org/T186567) (owner: 10Mobrovac)
[08:11:17] <wikibugs>	 (03PS2) 10Filippo Giunchedi: RESTBase: Disable cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/444247 (https://phabricator.wikimedia.org/T186567) (owner: 10Mobrovac)
[08:11:36] <joal>	 Thanks elukey and godog :)
[08:11:53] <wikibugs>	 10Operations, 10Phabricator: Rate-limit is too harsh and affects human users - https://phabricator.wikimedia.org/T198974 (10Jc86035)
[08:12:46] <wikibugs>	 (03PS2) 10Jcrespo: wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808
[08:14:16] <elukey>	 joal: aqs1004 running with the new config, if ok I'll complete the restarts
[08:15:01] <joal>	 elukey: 1 min for me to check please :)
[08:15:32] <wikibugs>	 (03PS2) 10Jcrespo: Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798
[08:16:39] <icinga-wm>	 PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:16:46] <godog>	 that's me ^
[08:16:52] <jynus>	 volans: please review https://gerrit.wikimedia.org/r/444808
[08:17:06] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Remove wgWikiEditorFeatures, dropped in master in Ia1eb91d2d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444655 (owner: 10Jforrester)
[08:17:08] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Remove Beta Cluster use of wikieditor-preview preference, no longer around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444656 (owner: 10Jforrester)
[08:17:12] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Stop setting wmgVisualEditorNonAccountEnableProportion to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444657 (owner: 10Jforrester)
[08:17:14] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Stop setting wgVisualEditorNonAccountEnableProportion, dropped in master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444658 (owner: 10Jforrester)
[08:17:16] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Stop setting wgTmhEnableMp3Uploads, dropped ages ago [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444659 (owner: 10Jforrester)
[08:17:18] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Stop setting wmgTmhEnableMp3Uploads, default true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444660 (owner: 10Jforrester)
[08:17:20] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: No need for officewiki-specific upload for MP3s any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444661 (owner: 10Jforrester)
[08:17:22] <wikibugs>	 (03CR) 10jenkins-bot: Cleanup: Stop trying to set wgLicenseURL, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444663 (https://phabricator.wikimedia.org/T154069) (owner: 10Jforrester)
[08:17:24] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[08:17:26] <wikibugs>	 (03PS1) 1020after4: Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974)
[08:17:28] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783 (owner: 10Marostegui)
[08:17:30] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[08:17:32] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785 (owner: 10Marostegui)
[08:17:34] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[08:17:36] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789 (owner: 10Marostegui)
[08:17:38] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[08:17:40] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791 (owner: 10Marostegui)
[08:17:42] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui)
[08:17:44] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793 (owner: 10Marostegui)
[08:17:46] <wikibugs>	 (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441519 (owner: 10Jforrester)
[08:17:48] <wikibugs>	 (03CR) 10jenkins-bot: Remove unnecessary code: $wgTidyConfig can never be null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444650 (owner: 10C. Scott Ananian)
[08:17:49] <icinga-wm>	 RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational
[08:17:50] <wikibugs>	 (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 (owner: 10Jforrester)
[08:17:52] <wikibugs>	 (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441520 (owner: 10Jforrester)
[08:17:54] <wikibugs>	 (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441521 (owner: 10Jforrester)
[08:19:58] <joal>	 elukey: Success !
[08:20:06] <joal>	 elukey: we can continue to rollout5
[08:20:48] <elukey>	 joal: I'll also add ip6 addresses too
[08:21:54] <wikibugs>	 (03PS1) 10Elukey: Add interface::add_ip6_mapped to aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/444811 (https://phabricator.wikimedia.org/T199180)
[08:22:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libjpeg-turbo [puppet] - 10https://gerrit.wikimedia.org/r/444812
[08:24:59] <icinga-wm>	 PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:25:03] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/444811 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey)
[08:26:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Add library hint for libjpeg-turbo [puppet] - 10https://gerrit.wikimedia.org/r/444812
[08:26:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hint for libjpeg-turbo [puppet] - 10https://gerrit.wikimedia.org/r/444812 (owner: 10Muehlenhoff)
[08:32:23] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808 (owner: 10Jcrespo)
[08:32:31] <wikibugs>	 (03PS3) 10Jcrespo: wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808
[08:34:44] <elukey>	 !log rolling restart of AQS to apply the new config
[08:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:59] <icinga-wm>	 RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational
[08:37:24] <godog>	 !log drain and restart cassandra-a on restbase2001 to test a restart
[08:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:35] <moritzm>	 !log installing libjpeg-turbo security updates on trusty
[08:38:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:10] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
[08:40:19] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused
[08:40:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo)
[08:42:55] <godog>	 expected ^
[08:45:06] <godog>	 should be recovering shortly
[08:45:13] <mobrovac>	 cool
[08:45:27] <mobrovac>	 godog: i will also restart rb on that node to check
[08:46:08] <godog>	 mobrovac: ok! I skipped rb because iirc the change didn't affect anything about it
[08:46:24] <mobrovac>	 true, but let's be sure :)
[08:46:59] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2020-06-24 13:01:24 +0000 (expires in 715 days)
[08:47:00] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042
[08:47:33] <mobrovac>	 ok all good for rb as well
[08:51:06] <godog>	 \o/
[08:53:00] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[08:53:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hints for libsoup2.4 [puppet] - 10https://gerrit.wikimedia.org/r/444814
[08:58:07] <mobrovac>	 checking ^
[08:59:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Add library hints for libsoup2.4 [puppet] - 10https://gerrit.wikimedia.org/r/444814
[08:59:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hints for libsoup2.4 [puppet] - 10https://gerrit.wikimedia.org/r/444814 (owner: 10Muehlenhoff)
[08:59:55] <godog>	 mobrovac: ah yeah that's because of a missing 'systemctl reset-failed cassandra-metrics-collector' I believe
[09:00:03] <godog>	 I'll run it on the rest of the dev cluster
[09:00:10] <mobrovac>	 ah ok :)
[09:00:23] <mobrovac>	 indeed it is
[09:00:33] <mobrovac>	 cassandra-metrics-collector.service not-found failed failed
[09:00:49] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational
[09:01:11] <wikibugs>	 10Operations, 10ops-eqiad: Relabel labvirt1021.eqiad.wmnet as cloudvirt1021.eqiad.wmnet - https://phabricator.wikimedia.org/T199132 (10aborrero)
[09:02:30] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10ema) >>! In T189290#4053740, @ema wrote: > It would have been much more useful to get such messages into `journalctl -u pybal.service`'s output instead, and I do...
[09:05:04] <moritzm>	 !log installing libsoup security updates on jessie/stretch
[09:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:39] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:08:59] <icinga-wm>	 PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:09:45] <wikibugs>	 (03PS1) 10Elukey: Add interface::add_ip6_mapped to druid, bohrium, thorium and meinerium [puppet] - 10https://gerrit.wikimedia.org/r/444819 (https://phabricator.wikimedia.org/T199180)
[09:11:03] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Last changed applied by Arzhel, including merging common-infrastructure4 to analytics-in4
[09:11:27] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudvps: rename labvirt1021.eqiad.wmnet to cloudvirt1021.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/444584 (https://phabricator.wikimedia.org/T199107)
[09:11:29] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to druid, bohrium, thorium and meinerium [puppet] - 10https://gerrit.wikimedia.org/r/444819 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey)
[09:12:19] <icinga-wm>	 RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational
[09:13:19] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudvps: reimage and rename labvirt1021 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/444581 (https://phabricator.wikimedia.org/T199107)
[09:13:54] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: rename labvirt1021.eqiad.wmnet to cloudvirt1021.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/444584 (https://phabricator.wikimedia.org/T199107) (owner: 10Arturo Borrero Gonzalez)
[09:14:26] <wikibugs>	 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10mark)
[09:14:35] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: reimage and rename labvirt1021 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/444581 (https://phabricator.wikimedia.org/T199107) (owner: 10Arturo Borrero Gonzalez)
[09:17:42] <wikibugs>	 (03PS3) 10ArielGlenn: quick script to show runtimes of dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/444603 (https://phabricator.wikimedia.org/T199117)
[09:25:28] <godog>	 !log swift eqiad-prod add ms-be1036 back gradually - T196873
[09:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:32] <stashbot>	 T196873: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873
[09:27:10] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[09:28:33] <elukey>	 working on --^
[09:29:03] <elukey>	 there seems to be some nagios check_disk processes in interruptible sleep, I think probably due to nfs or something similar
[09:30:00] <mobrovac>	 godog: applied everywhere?
[09:31:25] <godog>	 mobrovac: yup should be rolled out everywhere now
[09:33:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational
[09:33:30] <mobrovac>	 kk thnx!
[09:34:36] <elukey>	 !log forced umount of /mnt/hdfs on stat1004, several processes hang for it (causing load) and transport not connected
[09:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:49] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::cluster::client: enable by default /mnt/hdfs check [puppet] - 10https://gerrit.wikimedia.org/r/444825
[09:40:08] <wikibugs>	 (03PS2) 10Elukey: profile::analytics::cluster::client: enable by default /mnt/hdfs check [puppet] - 10https://gerrit.wikimedia.org/r/444825
[09:41:22] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::analytics::cluster::client: enable by default /mnt/hdfs check [puppet] - 10https://gerrit.wikimedia.org/r/444825 (owner: 10Elukey)
[09:42:26] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[09:42:26] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[09:42:26] <icinga-wm>	 PROBLEM - swift-container-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[09:42:35] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[09:42:35] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[09:42:45] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[09:42:46] <wikibugs>	 (03PS1) 10Ema: varnish: improve wm_common_directors_init readability [puppet] - 10https://gerrit.wikimedia.org/r/444827
[09:42:46] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[09:42:55] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[09:42:56] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[09:42:56] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[09:43:05] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[09:43:05] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[09:43:49] <godog>	 oops, that's me
[09:43:52] <godog>	 sorry about the spam
[09:44:37] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for exiv2 [puppet] - 10https://gerrit.wikimedia.org/r/444828
[09:45:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Add library hint for exiv2 [puppet] - 10https://gerrit.wikimedia.org/r/444828
[09:48:23] <volans>	 moritzm: I was wondering if we could use apt-cache depends in debdeploy to detect the related libraries, allowing to reduce a lot (if not all) the libraries hints
[09:49:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Add library hint for exiv2 [puppet] - 10https://gerrit.wikimedia.org/r/444828 (owner: 10Muehlenhoff)
[09:51:14] <wikibugs>	 10Operations, 10media-storage: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi)
[09:51:15] <moritzm>	 volans: the query_deps command uses apt-cache rdepends, but for detecting restarts we need the specific sonames. ideally this would be part of dpkg meta data, I want to propose support for this upstream, but needs more work/thought
[09:51:31] <wikibugs>	 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi)
[09:52:17] <moritzm>	 and when there's a public release of debdeploy I'll simply ship it in the package as a starting point for others (most updates already reuse existing hints by now anyway)
[09:52:26] <volans>	 moritzm: yeah, what I'm referring to is the 'exiv2 = libexiv2', that could be gathered by depends AFAIK, so for the simple cases it should work
[09:52:34] <moritzm>	 no, that'
[09:52:36] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper
[09:52:36] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor
[09:52:45] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[09:52:45] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator
[09:52:55] <icinga-wm>	 PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf]
[09:53:06] <moritzm>	 no, that's not the package name, it's the soname of the library which isn't necessary identical. for src:exiv2 it is, but it's not a reliable heuristic
[09:53:06] <icinga-wm>	 RECOVERY - swift-object-server on ms-be1040 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[09:53:06] <wikibugs>	 (03PS2) 10Ema: varnish: improve wm_common_directors_init readability [puppet] - 10https://gerrit.wikimedia.org/r/444827
[09:53:15] <icinga-wm>	 RECOVERY - swift-account-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server
[09:53:15] <icinga-wm>	 RECOVERY - swift-container-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server
[09:53:16] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor
[09:53:16] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be1040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[09:53:26] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater
[09:53:35] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator
[09:53:36] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[09:56:46] <icinga-wm>	 PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf]
[10:00:56] <volans>	 moritzm: sure, my thought was: maybe adding that heuristic in addition helps to reduce the list of manual hints required, not that will solve all of them
[10:02:55] <icinga-wm>	 RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[10:03:47] <elukey>	 !log restart analytics100[1,2]'s hadoop resource managers, some I/O socket errors after the ip6 interface change
[10:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:59] <moritzm>	 wouldn't really help, we really want reliable data here. ideally a future dpkg will provide the meta data, but until then it's an okay interim solution
[10:06:45] <icinga-wm>	 RECOVERY - puppet last run on druid1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:07:18] <wikibugs>	 (03PS3) 10Ema: varnish: improve directors definition readability [puppet] - 10https://gerrit.wikimedia.org/r/444827
[10:07:20] <moritzm>	 !log restarting thumbor on thumbor1001 to pick up exiv2 security updates
[10:07:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:07:46] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@d724ad1]: Fix up the invalid Vary header
[10:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:35] <icinga-wm>	 ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on cp1053 is CRITICAL: 225 ge 4 Ema The host is depooled -- T165252 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1053&var-datasource=eqiad%2520prometheus%252Fops
[10:15:23] <wikibugs>	 (03CR) 10Ema: [C: 031] Remove cp3048 prod DNS entries [dns] - 10https://gerrit.wikimedia.org/r/444560 (https://phabricator.wikimedia.org/T190607) (owner: 10Muehlenhoff)
[10:20:36] <logmsgbot>	 !log ppchelko@deploy1001 deploy aborted: Fix up the invalid Vary header (duration: 12m 50s)
[10:20:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:51] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@d724ad1]: Fix up the invalid Vary header, take 2, checer timed out
[10:20:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:27] <edsanders>	 The main VisualEditor workboard is 404ing: https://phabricator.wikimedia.org/project/board/483/
[10:22:38] <edsanders>	 our other boards appear to be fine, and it's appearing in search
[10:24:30] <wikibugs>	 (03PS1) 10Mobrovac: RESTBase: Add Proton's URI [puppet] - 10https://gerrit.wikimedia.org/r/444835 (https://phabricator.wikimedia.org/T186748)
[10:27:30] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d724ad1]: Fix up the invalid Vary header, take 2, checer timed out (duration: 06m 39s)
[10:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:05] <wikibugs>	 (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/compiler02/11752/" [puppet] - 10https://gerrit.wikimedia.org/r/444835 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac)
[10:36:33] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837
[10:37:15] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837
[10:38:14] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] varnish: improve directors definition readability [puppet] - 10https://gerrit.wikimedia.org/r/444827 (owner: 10Ema)
[10:41:56] <wikibugs>	 (03CR) 10Ema: [C: 032] varnish: improve directors definition readability [puppet] - 10https://gerrit.wikimedia.org/r/444827 (owner: 10Ema)
[10:42:15] <wikibugs>	 (03PS1) 10ArielGlenn: make stub dumps on a couple more wikis ordered by revisions within pages [puppet] - 10https://gerrit.wikimedia.org/r/444838 (https://phabricator.wikimedia.org/T29112)
[10:50:57] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 56.24, 33.31, 22.32
[10:51:18] <wikibugs>	 (03PS2) 10ArielGlenn: make stub dumps on a couple more wikis ordered by revisions within pages [puppet] - 10https://gerrit.wikimedia.org/r/444838 (https://phabricator.wikimedia.org/T29112)
[10:52:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 65.83, 36.97, 23.31
[10:52:37] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] make stub dumps on a couple more wikis ordered by revisions within pages [puppet] - 10https://gerrit.wikimedia.org/r/444838 (https://phabricator.wikimedia.org/T29112) (owner: 10ArielGlenn)
[10:53:04] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839
[10:53:23] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove cp3048 prod DNS entries [dns] - 10https://gerrit.wikimedia.org/r/444560 (https://phabricator.wikimedia.org/T190607)
[10:55:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Remove cp3048 prod DNS entries [dns] - 10https://gerrit.wikimedia.org/r/444560 (https://phabricator.wikimedia.org/T190607) (owner: 10Muehlenhoff)
[10:56:14] <wikibugs>	 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 (10MoritzMuehlenhoff)
[10:58:15] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 49.00, 33.59, 23.83
[10:58:30] <wikibugs>	 10Operations, 10Analytics, 10hardware-requests: eqiad: (2) hardware refresh for analytics1003 - https://phabricator.wikimedia.org/T198685 (10MoritzMuehlenhoff) p:05Triage>03Normal
[10:58:36] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 61.06, 41.88, 27.14
[10:59:08] <wikibugs>	 10Operations, 10Traffic: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10MoritzMuehlenhoff)
[10:59:21] <wikibugs>	 10Operations, 10ops-esams: Degraded RAID on cp3048 - https://phabricator.wikimedia.org/T198784 (10MoritzMuehlenhoff) p:05Triage>03Normal
[10:59:46] <wikibugs>	 10Operations, 10procurement: leases expiring on labvirt1010 and 1011 - https://phabricator.wikimedia.org/T198762 (10MoritzMuehlenhoff) p:05Triage>03Normal
[10:59:56] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: bootstrap: neutron: refresh and add more hints [puppet] - 10https://gerrit.wikimedia.org/r/444222 (https://phabricator.wikimedia.org/T196633)
[11:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1100).
[11:00:04] <jouncebot>	 dcausse and Aaron Schulz: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:40] <zeljkof>	 dcausse, AaronSchulz: want to deploy your own patches?
[11:01:26] <dcausse>	 o/
[11:01:31] <dcausse>	 zeljkof: sure
[11:01:39] <AaronSchulz>	 I suppose
[11:02:02] <zeljkof>	 dcausse, AaronSchulz: swat is yours then :) 
[11:02:13] <zeljkof>	 self-organize and go ahead
[11:02:26] <dcausse>	 ok starting to deploy mine (it should be quick, it's just a cleanup)
[11:02:53] <hashar>	 I am around if needed
[11:02:55] <wikibugs>	 (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444576 (owner: 10DCausse)
[11:03:27] <wikibugs>	 10Operations, 10ops-eqiad: Relabel labvirt1021.eqiad.wmnet as cloudvirt1021.eqiad.wmnet - https://phabricator.wikimedia.org/T199132 (10aborrero) The server was reimaged+renamed and I just upgraded the racktables record.
[11:03:36] <icinga-wm>	 PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:03:43] <wikibugs>	 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Other differences with existing machines include the fact that the last batch has been installed with stretch from the get go as opposed to jessie,...
[11:04:26] <icinga-wm>	 RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0)
[11:04:39] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] cleanup unused config vars 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444576 (owner: 10DCausse)
[11:05:54] <wikibugs>	 (03PS1) 10Ema: varnish: avoid adding vtc_backend multiple times [puppet] - 10https://gerrit.wikimedia.org/r/444840
[11:06:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839
[11:06:13] <wikibugs>	 (03PS1) 10ArielGlenn: one more wiki with order by revs for stubs dump [puppet] - 10https://gerrit.wikimedia.org/r/444841 (https://phabricator.wikimedia.org/T29112)
[11:07:30] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] one more wiki with order by revs for stubs dump [puppet] - 10https://gerrit.wikimedia.org/r/444841 (https://phabricator.wikimedia.org/T29112) (owner: 10ArielGlenn)
[11:07:42] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, 10Patch-For-Review: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10abian) wikiba.se is a bit unstable. Today has been down for some hours (from ~1:00 UTC to ~5:30 UTC). Last issues were detected on...
[11:08:13] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] cleanup unused config vars 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444576 (owner: 10DCausse)
[11:09:26] <logmsgbot>	 !log dcausse@deploy1001 Synchronized ./wmf-config/: [cirrus] cleanup unused config vars 1/2 (duration: 00m 53s)
[11:09:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:38] <wikibugs>	 (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse)
[11:11:23] <wikibugs>	 (03CR) 10Ema: [C: 032] "pcc output looks good: https://puppet-compiler.wmflabs.org/compiler02/11753/" [puppet] - 10https://gerrit.wikimedia.org/r/444840 (owner: 10Ema)
[11:11:31] <wikibugs>	 (03PS2) 10Ema: varnish: avoid adding vtc_backend multiple times [puppet] - 10https://gerrit.wikimedia.org/r/444840
[11:11:46] <wikibugs>	 (03PS1) 10ArielGlenn: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/444850
[11:12:36] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 8.22, 18.40, 23.95
[11:13:21] <wikibugs>	 10Operations, 10ops-eqiad: Relabel labvirt1022.eqiad.wmnet as cloudvirt1022.eqiad.wmnet - https://phabricator.wikimedia.org/T199203 (10aborrero)
[11:16:25] <dcausse>	 hashar: should I wait or rebase (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/444577/) I don't get the status of this patch here
[11:16:50] <wikibugs>	 (03PS14) 10Ema: cache_text: add support for alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443906 (https://phabricator.wikimedia.org/T164609)
[11:16:52] <wikibugs>	 (03PS14) 10Ema: cache_text: add misc directors and alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443907 (https://phabricator.wikimedia.org/T164609)
[11:16:54] <wikibugs>	 (03PS9) 10Ema: cache_text: load misc VCL as wikimedia_misc in VTC files [puppet] - 10https://gerrit.wikimedia.org/r/443930 (https://phabricator.wikimedia.org/T164609)
[11:16:56] <wikibugs>	 (03PS7) 10Ema: cache_text: add misc-specific VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/443974 (https://phabricator.wikimedia.org/T164609)
[11:17:25] <wikibugs>	 (03CR) 10TerraCodes: "Since I3773876fa7aa9205a5ea98cbbbdecaef9c06ff81 is deployed, should this patch be abandoned?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420948 (owner: 10Niharika29)
[11:18:40] <wikibugs>	 (03PS3) 10DCausse: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577
[11:18:43] <dcausse>	 hashar: with the new gerrit ui it's not clear that it needs a rebase (with the old one I see "Cannot merge")
[11:19:20] <wikibugs>	 (03CR) 10DCausse: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse)
[11:20:43] <wikibugs>	 (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse)
[11:21:25] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 13.71, 16.06, 23.76
[11:21:55] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 14.06, 21.31, 29.61
[11:22:27] <wikibugs>	 (03Merged) 10jenkins-bot: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse)
[11:22:39] <wikibugs>	 (03CR) 10jenkins-bot: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse)
[11:24:05] <wikibugs>	 (03PS3) 10Aaron Schulz: Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239)
[11:24:57] <moritzm>	 !log installing PHP 7 security updates on stretch
[11:24:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:06] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-compute service [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633)
[11:25:59] <logmsgbot>	 !log dcausse@deploy1001 Synchronized ./wmf-config/InitialiseSettings.php: [cirrus] cleanup unused config vars 2/2 (duration: 01m 40s)
[11:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:31] <dcausse>	 AaronSchulz: I'm done, please go ahead
[11:26:49] <AaronSchulz>	 ok
[11:27:00] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 032] Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[11:28:35] <wikibugs>	 (03Merged) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[11:32:14] <wikibugs>	 (03CR) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[11:32:50] <logmsgbot>	 !log aaron@deploy1001 Synchronized wmf-config/mc.php: Make all non-test wikis write to both nutcracker and mcrouter (duration: 00m 51s)
[11:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:36:25] <Raymond_>	 Since a few minutes I am gettings Fatals Error "Typs „Exception“ on Wikimedia Commons while saving
[11:36:27] <Raymond_>	 [W0SZmgpAAD4AAFHXWLcAAACR] 2018-07-10 11:33:46: Fataler Ausnahmefehler des Typs „Exception“
[11:38:35] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 13.75, 15.22, 29.89
[11:39:56] <icinga-wm>	 PROBLEM - DPKG on mw2250 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[11:40:05] <AaronSchulz>	 hashar: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/SpamBlacklist/+/444856/
[11:42:15] <icinga-wm>	 RECOVERY - DPKG on mw2250 is OK: All packages OK
[11:43:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[11:43:35] <arturo>	 !log updated compiler facts `PUPPET_MASTERS=puppetmaster1001.eqiad.wmnet PUPPET_COMPILER=compiler02.puppet3-diffs.eqiad.wmflabs modules/puppet_compiler/files/compiler-update-facts`
[11:43:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:20] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-compute service [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633)
[11:50:30] <addshore>	 zeljkof: happy training today :D
[11:50:54] <zeljkof>	 addshore: thanks :P
[11:51:01] <zeljkof>	 enjoying it
[11:53:16] <logmsgbot>	 !log ppchelko@deploy1001 Started deploy [restbase/deploy@2674971]: Roll out cassandra-driver@3.5.0 to restbase2001
[11:53:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:20] <addshore>	 AaronSchulz: is that change going to be backported or the other patch reverted?
[11:53:55] <icinga-wm>	 PROBLEM - Memcached on labtestweb2001 is CRITICAL: connect to address 208.80.153.14 and port 11000: Connection refused
[11:54:13] <AaronSchulz>	 addshore: backported
[11:54:19] <addshore>	 ack :)
[11:54:38] <AaronSchulz>	 addshore: want to CR for master? 
[11:55:18] <addshore>	 can do
[11:56:09] <logmsgbot>	 !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2674971]: Roll out cassandra-driver@3.5.0 to restbase2001 (duration: 02m 52s)
[11:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:05] <Reedy>	 Why did that suddenly break?
[11:58:16] <addshore>	 AaronSchulz: +2ed on master, also https://phabricator.wikimedia.org/T199216 if you want to use the bug number in SAL entries :)
[11:58:24] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-compute service [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633)
[11:59:11] <addshore>	 11:32 aaron@deploy1001: Synchronized wmf-config/mc.php: Make all non-test wikis write to both nutcracker and mcrouter (duration: 00m 51s)
[11:59:13] <addshore>	 Reedy: ^^
[11:59:24] <Reedy>	 lol
[11:59:57] <Reedy>	 https://phabricator.wikimedia.org/T199039
[12:00:04] <Reedy>	 It's not the only on
[12:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1200)
[12:00:06] <Reedy>	 one
[12:00:10] <Reedy>	 Translate needs fixing it seems
[12:00:18] <AaronSchulz>	 addshore: that didn't used to be an exception back when, just a warning
[12:00:50] <addshore>	 might be better to just revert
[12:00:58] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11759/ compiler is good" [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez)
[12:01:19] <AaronSchulz>	 addshore: maybe, waiting on jenkins is a tad slow
[12:02:37] <wikibugs>	 (03PS1) 10Aaron Schulz: Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862
[12:02:37] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 032] Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862 (owner: 10Aaron Schulz)
[12:04:17] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862 (owner: 10Aaron Schulz)
[12:05:46] <icinga-wm>	 RECOVERY - Memcached on labtestweb2001 is OK: TCP OK - 0.036 second response time on 208.80.153.14 port 11000
[12:05:46] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:05:50] <Reedy>	 AaronSchulz: addshore there's at least two other bugs like this already reported that haven't been fixed...
[12:06:39] <logmsgbot>	 !log aaron@deploy1001 Synchronized wmf-config/mc.php: Revert "Make all non-test wikis write to both nutcracker and mcrouter" (duration: 00m 56s)
[12:06:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:09] <Reedy>	 https://gerrit.wikimedia.org/r/444863 / https://phabricator.wikimedia.org/T199039
[12:08:15] <Reedy>	 And also https://phabricator.wikimedia.org/T199218
[12:08:54] <wikibugs>	 (03CR) 10jenkins-bot: Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862 (owner: 10Aaron Schulz)
[12:10:15] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[12:11:18] <AaronSchulz>	 hmm, the RemoteFileDescription was already fixed in master it seems
[12:11:25] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[12:20:36] <logmsgbot>	 !log aaron@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/SpamBlacklist/includes/SpamBlacklist.php: 583dc7a92f9b (duration: 00m 51s)
[12:20:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:01] <wikibugs>	 10Operations, 10MediaWiki-Configuration: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10dcausse)
[12:23:36] <icinga-wm>	 PROBLEM - puppet last run on mw1347 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid]
[12:24:31] <logmsgbot>	 !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/extensions/SpamBlacklist/includes/SpamBlacklist.php: 08a2153f7aa (duration: 00m 51s)
[12:24:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:48] <wikibugs>	 10Operations, 10MediaWiki-Configuration: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10dcausse) p:05Triage>03Low
[12:25:24] <AaronSchulz>	 enough of that for now
[12:25:46] * AaronSchulz wonders if something is up with makeKeyInternal()
[12:27:42] <AaronSchulz>	 Reedy: so the main problem is 61a7e1acd0af4a5386df03335733accfde179fa1 is not yet being in wmf10
[12:27:52] <AaronSchulz>	 that commit seemed like ages ago
[12:28:09] <Reedy>	 Haha
[12:28:18] <Reedy>	 13 days is nearly 2 weeks
[12:28:34] <AaronSchulz>	 the other key stuff is low priority cleanup
[12:30:48] <AaronSchulz>	 heh, the backport was abandoned...may as well restore and merge
[12:31:49] <AaronSchulz>	 though technically nobody should deploy anything now
[12:32:09] <AaronSchulz>	 zeljkof: are you doing the train later?
[12:32:25] <Reedy>	 He's AFK for 30
[12:34:20] <AaronSchulz>	 I'll just backport that one quickly
[12:35:50] <AaronSchulz>	 technically group0 wikis are bugged without it (though they are en sites with low traffic so the error rate is low)
[12:35:57] <zeljkof>	 AaronSchulz: yes
[12:38:41] <wikibugs>	 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10Vgutierrez) IMHO it would be great if those messages were in stderr instead of stdout otherwise you miss them when using journalctl + grep, that's why we missed...
[12:38:54] <AaronSchulz>	 zeljkof: so nothing is happening during "MediaWiki train - Americas version" today right?
[12:39:37] <zeljkof>	 AaronSchulz: as far as I know, correct
[12:39:52] <AaronSchulz>	 hmm, perhaps I can retry the mc part then
[12:41:23] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Depool db1086 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444869
[12:41:43] <Reedy>	 AaronSchulz: I guess after backporting your change and merging it... Just check the logs for generic key errors first
[12:46:38] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Allow reimage of db108X servers [puppet] - 10https://gerrit.wikimedia.org/r/444870
[12:49:05] <icinga-wm>	 RECOVERY - puppet last run on mw1347 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[12:52:26] <wikibugs>	 (03PS1) 10Vgutierrez: authdns: Replace baham with authdns2001 [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664)
[12:53:22] <logmsgbot>	 !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/includes/libs/objectcache/ReplicatedBagOStuff.php: 4ad6b70ba132c66e14a706eae240887885946a42 (duration: 00m 51s)
[12:53:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:59:55] <wikibugs>	 (03PS2) 10Vgutierrez: authdns: Add authdns2001 to the list of authdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664)
[12:59:57] <wikibugs>	 (03PS1) 10Vgutierrez: authdns: Remove baham from the authdns host list [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664)
[13:00:04] <jouncebot>	 zeljkof: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1300).
[13:01:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "To be merged after testing the syncing between the 4 dns servers and" [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez)
[13:01:09] <wikibugs>	 (03PS5) 10Muehlenhoff: Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454)
[13:02:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[13:06:48] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db108X servers [puppet] - 10https://gerrit.wikimedia.org/r/444870 (owner: 10Jcrespo)
[13:06:54] <wikibugs>	 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) I have stopped swift/rsync on `ms-be1040` to inspect further, taking e.g. `sde` and running `xfs_repair -n` on it mentioned a discrepancy in free b...
[13:06:56] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Allow reimage of db108X servers [puppet] - 10https://gerrit.wikimedia.org/r/444870
[13:07:35] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: change to api manager override [puppet] - 10https://gerrit.wikimedia.org/r/444878 (https://phabricator.wikimedia.org/T198950)
[13:09:06] <wikibugs>	 (03PS2) 10Andrew Bogott: nova: change to api manager override [puppet] - 10https://gerrit.wikimedia.org/r/444878 (https://phabricator.wikimedia.org/T198950)
[13:10:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nova: change to api manager override [puppet] - 10https://gerrit.wikimedia.org/r/444878 (https://phabricator.wikimedia.org/T198950) (owner: 10Andrew Bogott)
[13:10:31] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani)
[13:12:16] <wikibugs>	 (03Merged) 10jenkins-bot: Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani)
[13:12:41] <wikibugs>	 (03CR) 10jenkins-bot: Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani)
[13:12:43] <wikibugs>	 (03PS2) 10Daniel Kinzler: wgMultiContentRevisionSchemaMigrationStage MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore)
[13:14:59] <logmsgbot>	 !log thcipriani@deploy1001 Synchronized scap/plugins/clean.py: no op sync for consistancy [[gerrit:441920|Scap clean: remove remote cache directory]] (duration: 00m 51s)
[13:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:20] <wikibugs>	 (03PS1) 10Nehajha: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415)
[13:17:40] <wikibugs>	 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Graphs of disk usage for the remaining affected hosts:  {F23554534}  {F23554533}  {F23554532}
[13:25:39] <logmsgbot>	 !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.4 (duration: 06m 18s)
[13:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:57] <moritzm>	 !log installing cups security updates on jessie
[13:26:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:41] <logmsgbot>	 !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.5 (duration: 03m 13s)
[13:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:13] <wikibugs>	 (03PS2) 10Daniel Kinzler: MCR DNM Enable MCR write-both mode on commons beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442918 (https://phabricator.wikimedia.org/T197818)
[13:30:15] <thcipriani>	 zeljkof: ok, done cleaning for now, should be good, sorry to take up so much time
[13:30:56] <zeljkof>	 thcipriani: thanks for the help!
[13:35:40] <zeljkof>	 thcipriani, hashar: it might be obvious to some, but not to me :) I'm here https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_patches_to_update_wikiversions.json
[13:36:03] <zeljkof>	 and I'm not sure from where should I run `scap update-wikiversions group0 VERSION `
[13:36:29] <zeljkof>	 I guess from deploy1001 machine, but which folder?
[13:36:42] <thcipriani>	 in deploy1001:/srv/mediawiki-staging
[13:36:43] <hashar>	  /srv/mediawiki-staging
[13:37:28] <zeljkof>	 ah, thanks, just saw it in the next step :) will update the docs
[13:38:27] <zeljkof>	 thcipriani, hashar: it would be great if you could take a look at the docs a few times today, I have been updating them as I go along, to make sure I did not misunderstood something and broke the docs :)
[13:38:38] <thcipriani>	 sure :)
[13:38:57] <zeljkof>	 thanks
[13:39:56] <wikibugs>	 (03CR) 10Zhuyifei1999: Providing users more clue when kuberenetes is unable to delete all the objects (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha)
[13:40:34] <moritzm>	 !log rolling restart of thumbor to pick up new exiv2/openssl
[13:40:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:26] <wikibugs>	 (03PS2) 10Nehajha: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415)
[13:45:54] <zeljkof>	 ok, hopefully this is ok https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_patches_to_update_wikiversions.json
[13:45:57] <zeljkof>	 doing it
[13:48:28] <moritzm>	 !log installing subversion updates from jessie 9.11 point release
[13:48:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:15] <wikibugs>	 (03CR) 10Aftab: [C: 031] "ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm)
[13:52:23] <wikibugs>	 (03PS1) 10Andrew Bogott: nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886
[13:53:00] <moritzm>	 !log installing ruby-sprockets security updates
[13:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886 (owner: 10Andrew Bogott)
[13:53:27] <wikibugs>	 (03PS1) 10Zfilipin: Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887
[13:59:58] <wikibugs>	 (03CR) 10Thcipriani: [C: 031] Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin)
[14:01:55] <wikibugs>	 (03PS2) 10Andrew Bogott: nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886
[14:02:00] <hashar>	 zeljkof: I went hunting a chromium bug this afternoon sorry :D
[14:02:10] <zeljkof>	 thcipriani: ok, did everything up to `scap clean` 
[14:02:48] <zeljkof>	 since you did it, I can skip it, right? for future reference, it also runs from `you@deploy1001:/srv/mediawiki-staging`?
[14:03:00] <wikibugs>	 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683 (10Qgil) a:03Qgil At the end we have decided to stop using this list.
[14:03:06] <zeljkof>	 hashar: bad time for bug hunting! ;)
[14:03:15] <wikibugs>	 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683 (10Qgil)
[14:03:19] <zeljkof>	 I _think_ things are fine so far
[14:03:35] <zeljkof>	 nobody is screaming at me. yet
[14:03:41] <thcipriani>	 zeljkof: yeah, it runs from /srv/mediawiki-staging I did some cleanup, still more to do, but it should be fine to skip for now
[14:03:59] <zeljkof>	 thcipriani: ok, updating docs and skipping this time
[14:04:05] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel)
[14:04:53] <raynor>	 hey, I have couple NO-OP config changes to deploy - it's removeing the unused config variables - what is the protocol here?
[14:05:25] <Reedy>	 SWAT? :)
[14:05:45] <raynor>	 do I need to do that during SWAT windows?
[14:05:59] <Reedy>	 You dont have to. But that's probably the easiest
[14:06:03] <raynor>	 example: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/444591/
[14:06:29] <Reedy>	 Do you have deploy access?
[14:06:31] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886 (owner: 10Andrew Bogott)
[14:06:32] <raynor>	 yes
[14:06:37] <zeljkof>	 raynor: please don't do it now, since I'm lost in the middle of my very first train :)
[14:06:53] <Reedy>	 If you've got quite a few cleanup patches to deploy, you can find a deploy window yourself
[14:07:24] <raynor>	 I can do that during swat, but because those are noop I didn't want to block the SWAT window
[14:08:07] <raynor>	 zeljkof: don't worry, now I'm just asking for info when should I do that
[14:08:40] <zeljkof>	 raynor: any swat should do, even taking the entire window should be fine
[14:08:49] <raynor>	 maybe I can do everything like after a SWAT window, lets say tomorrow mid-day
[14:08:52] <zeljkof>	 or just pick a time with no deployments
[14:09:30] <raynor>	 today is a train day, I'll do that tomorrow before mid-day SWAT
[14:09:35] <zeljkof>	 raynor: just make sure greg-g knows about it, he is probably go-to person
[14:09:53] <raynor>	 ok, thx
[14:15:08] <moritzm>	 !log installing tiff security updates on jessie
[14:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:45] <logmsgbot>	 !log zfilipin@deploy1001 Started scap: testwiki to php-1.32.0-wmf.12 and rebuild l10n cache
[14:17:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:41] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: Replace disk on wasat - https://phabricator.wikimedia.org/T197562 (10ArielGlenn) /dev/sda is just not showing up; that should be the disk that was replaced in T193394  ``` root@wasat:/var/log# ls -l /dev/disk/by-id total 0 lrwxrwxrwx 1 root root  9 Mar  6 08:39 ata-MM0500...
[14:22:09] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: snapshot1005 does not power back up - https://phabricator.wikimedia.org/T198792 (10Cmjohnson) I attempted to power off, unplug and power the server back on,  unfortunately it does not want to power on...i just get a flashing gre...
[14:23:39] <zeljkof>	 hashar, thcipriani: is this stuck or just taking a long time, it's 5 minutes so far... `14:18:06 Updating LocalisationCache for 1.32.0-wmf.10 using 30 thread(s)`
[14:24:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: snapshot1005 does not power back up - https://phabricator.wikimedia.org/T198792 (10ArielGlenn) The service was moved to another host already (which had been the spare) but we do still want a spare. February is not that far away...
[14:24:17] <thcipriani>	 zeljkof: you're going to be staring at that for a little while yet, that's the most time-intensive part
[14:24:55] <zeljkof>	 thcipriani: ah, time for a quick break then :) should I use `screen` for that one, since it's lost?
[14:25:01] <zeljkof>	 (for future reference)
[14:25:15] <wikibugs>	 (03CR) 10Gehel: "Looks reasonable to me. One not entirely minor point is the communication from analytics cluster, which does not go through LVS, but talks" [puppet] - 10https://gerrit.wikimedia.org/r/444610 (owner: 10EBernhardson)
[14:25:15] <thcipriani>	 I think on the new machine it's been taking like 20 minutes or so
[14:25:18] <zeljkof>	 "long", not "lost"
[14:25:36] <thcipriani>	 zeljkof: yeah, I tend to do the entirety of train in a tmux sessions
[14:25:46] <thcipriani>	 s/s$//
[14:26:07] <zeljkof>	 thcipriani: hm, maybe we should make it explicit in the docs...
[14:27:20] <thcipriani>	 sure probably a good idea: start a screen or tmux session at the start of the window, do all this there to make your life easier
[14:31:12] <marostegui>	 !log Set disk #0 offline for replacement - T199056
[14:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:19] <stashbot>	 T199056: db1069 bad disk - https://phabricator.wikimedia.org/T199056
[14:31:35] <wikibugs>	 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) I've posted to `linux-xfs` mailing list to ask if someone has run into this bug before and how to debug further.  Regardless of whether we can succ...
[14:32:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) @Cmjohnson disk #0 is now offline, feel free to replace it when you can.
[14:34:22] <wikibugs>	 10Operations, 10ops-eqiad: Relabel labvirt1022.eqiad.wmnet as cloudvirt1022.eqiad.wmnet - https://phabricator.wikimedia.org/T199203 (10Cmjohnson) 05Open>03Resolved
[14:35:24] <zeljkof>	 thcipriani: any advice for us mere mortals that don't use screen/tmux much? just `screen` or `tmux` at the start, appropriate command at the end? any command line flags recommended?
[14:35:29] <wikibugs>	 10Operations, 10ops-eqiad: Relabel labvirt1021.eqiad.wmnet as cloudvirt1021.eqiad.wmnet - https://phabricator.wikimedia.org/T199132 (10Cmjohnson) 05Open>03Resolved
[14:37:04] <thcipriani>	 I use: tmux new -s 'train'; from there I do my work in the window. When I'm done I `exit` until I'm off the server. If you need to leave in the middle you can do: ctrl-b d to detach
[14:38:43] <zeljkof>	 thcipriani: sounds good to me! I'll add it to the docs :)
[14:39:32] <thcipriani>	 for screen, I guess I'd do: screen -D -RR train; and ctrl-a d to detach. I don't remember why I do -D -RR, but that's the flags I've memorized :)
[14:39:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) Disk replaced by Chris, let's see if this time it turns out fine! ``` root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0  Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 1% in 1...
[14:46:07] <zeljkof>	 thcipriani: I'll add it to the docs, I'm sure it will get fixed soon if it's not the correct way to do it :)
[14:50:45] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db1069 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops
[14:54:36] <icinga-wm>	 PROBLEM - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[14:54:36] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T199232
[14:54:52] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T199232 (10ops-monitoring-bot)
[14:57:27] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T199232 (10Marostegui)
[14:57:29] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui)
[15:00:55] <zeljkof>	 thcipriani: looks good? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#tmux_or_screen
[15:01:08] * thcipriani looks
[15:01:30] <thcipriani>	 bd
[15:01:45] <zeljkof>	 up to `sync-apaches:   0% (ok: 0; fail: 0; left: 264)`
[15:01:49] <zeljkof>	 but stuck there :/
[15:01:58] <Reedy>	 How long for?
[15:04:28] <zeljkof>	 Reedy: not stuck any more, up to 50%
[15:04:37] <Reedy>	 Did it report any errors?
[15:04:41] <zeljkof>	 thcipriani: bd?
[15:04:57] <zeljkof>	 Reedy: no so far, as far as I can see 
[15:05:02] <thcipriani>	 heh, bd <- two thumbs up :)
[15:05:09] <zeljkof>	 thcipriani: :D
[15:09:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394 (10RobH) @volans: Are you handling the reimagine?  This host is still email spamming about the defunct disk.  There is also related/duplicate task T197562,
[15:09:40] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: Replace disk on wasat - https://phabricator.wikimedia.org/T197562 (10RobH) 05Open>03Invalid This is indeed a dupe of T193394 which has some info.  Closing this as a dupe.
[15:10:37] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 49.74, 25.55, 18.47
[15:10:57] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394 (10Volans) @RobH No, not in my plate, I was told it was about to be reimaged, I think that @MoritzMuehlenhoff and @elukey might have more info about this.
[15:12:47] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 24.22, 24.51, 19.07
[15:13:11] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Create dedicated mailing list for schema changes, API changes, and other things affecting tool maintainers - https://phabricator.wikimedia.org/T199234 (10MusikAnimal)
[15:15:56] <zeljkof>	 thcipriani, hashar: I have just noticed that train window is officially over, but I'm still at https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki
[15:16:15] <Reedy>	 Keep going :)
[15:16:33] <zeljkof>	 this is running for the last hour or so :/ `zfilipin@deploy1001:/srv/mediawiki-staging$ scap sync "testwiki to php-1.32.0-wmf.12 and rebuild l10n cache"`
[15:17:04] <zeljkof>	 the first time `14:17:45 Started scap: testwiki to php-1.32.0-wmf.12 and rebuild l10n cache`
[15:17:23] <zeljkof>	 the last, and still not over `15:08:11 Started scap-cdb-rebuild`
[15:18:02] <Reedy>	 That's gonna take a while
[15:18:42] <greg-g>	 yeah, don't stop, the next thing is puppetswat with nothing scheduled
[15:19:13] <logmsgbot>	 !log zfilipin@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.12 and rebuild l10n cache (duration: 61m 27s)
[15:19:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:23] <zeljkof>	 oh, this step is finally done :)
[15:19:28] <Reedy>	 heh
[15:19:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio)
[15:27:29] <wikibugs>	 (03PS15) 10Ema: cache_text: add support for alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443906 (https://phabricator.wikimedia.org/T164609)
[15:28:28] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_text: add support for alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443906 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema)
[15:28:53] <wikibugs>	 (03PS15) 10Ema: cache_text: add misc directors and alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443907 (https://phabricator.wikimedia.org/T164609)
[15:29:40] <wikibugs>	 (03CR) 10Ema: [C: 032] cache_text: add misc directors and alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443907 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema)
[15:33:54] <zeljkof>	 thcipriani, hashar, greg-g, Reedy: stupid question, but how do I check  l10n cache? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki
[15:33:59] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: Handle SMART for multiple shelves attached to a single smartarray controller - https://phabricator.wikimedia.org/T199236 (10fgiunchedi) p:05Triage>03Normal
[15:34:04] <Reedy>	 zeljkof: Visit the wiki
[15:34:08] <Reedy>	 Check the messages are there
[15:34:09] <zeljkof>	 ok
[15:34:14] <Reedy>	 And not <foobar> <baz> <foo>
[15:34:26] <zeljkof>	 ah, just testwiki should look ok, not broken :)
[15:34:35] <Reedy>	 Yeah
[15:34:38] <Reedy>	 Rather than like https://en.wikipedia.org/wiki/?uselang=qqx
[15:35:05] <zeljkof>	 thanks! :)
[15:35:16] <zeljkof>	 (will update the docs, for the next clueless deployer)
[15:38:12] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi)
[15:38:24] <wikibugs>	 (03PS3) 10Vgutierrez: [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631
[15:38:27] <icinga-wm>	 PROBLEM - eventstreams on scb2003 is CRITICAL: connect to address 10.192.0.33 and port 8092: Connection refused
[15:38:28] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retriev
[15:38:28] <icinga-wm>	 initions for cat) timed out before a response was received
[15:38:38] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) lvs1015 idrac is setup, I think it's cabled correctly but I am not really sure, enp4s0f1 doesn't translate for me looking at h/w but I am pretty sure it matches...
[15:38:47] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received
[15:39:18] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi)
[15:39:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez)
[15:39:28] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[15:39:28] <icinga-wm>	 RECOVERY - eventstreams on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.089 second response time
[15:39:47] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy
[15:40:30] <ema>	 mobrovac: troubles with restbase?
[15:43:25] <mobrovac>	 ah no, ignore ema
[15:44:07] <Reedy>	 Is that advice for life?
[15:44:15] <ema>	 'ignore ema'
[15:44:29] <ema>	 it isn't bad advice really! :)
[15:44:57] <icinga-wm>	 RECOVERY - MegaRAID on db1069 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy
[15:46:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) @Cmjohnson take into account that eth0 should be enp4s0f0, not enp4s0f1 :)  BTW, would you mind checking the ethernet firmware version and update them if neede...
[15:48:29] <vgutierrez>	 friendly reminder: we are going to replace baham.w.o with authdns2001.w.o in a few minutes (16:00 UTC), please don't merge any operations/dns commits till we are done, thanks! :)
[15:52:07] <icinga-wm>	 PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: 16 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[15:52:12] <wikibugs>	 (03PS1) 10WMDE-Fisch: Enable FileExporter for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444901 (https://phabricator.wikimedia.org/T198594)
[15:53:06] <vgutierrez>	 ema ^^
[15:53:28] <ema>	 vgutierrez: yup, looking
[15:53:34] <vgutierrez>	 <3
[15:56:21] <wikibugs>	 (03PS4) 10Vgutierrez: [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631
[15:57:25] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) I added the IPv6 equivalent of the v4 filter with a default "log+permit" term, so we can see if we missed anything.  3 highlig...
[15:57:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez)
[15:58:28] <zeljkof>	 thcipriani, hashar, Reedy: um, so I put my mediawiki username/password in the file? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Update_deploy_notes
[15:58:43] <zeljkof>	 I have 2FA, would that even work?
[15:59:00] <Reedy>	 BotPasswords!
[15:59:02] <Reedy>	 I think...
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1600). Please do the needful.
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:07] <Reedy>	 I've not run that script in ages
[16:02:43] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] authdns: Add authdns2001 to the list of authdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez)
[16:02:54] <wikibugs>	 (03PS3) 10Vgutierrez: authdns: Add authdns2001 to the list of authdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664)
[16:03:31] <zeljkof>	 Reedy: thanks! looking for botpasswords
[16:03:40] <vgutierrez>	 !log replacing baham with authdns2001 - T196664
[16:03:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:43] <stashbot>	 T196664: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664
[16:04:04] <zeljkof>	 marxarelli: around for a little help with this? :) https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Update_deploy_notes
[16:04:53] <thcipriani>	 zeljkof: so I wrote a replacement for update-deploy-notes that I've been using, but I know that others who do the train still use the php script
[16:05:05] <thcipriani>	 it's in the tools-release repo
[16:05:36] <thcipriani>	 makedeploynotes.py
[16:05:56] <zeljkof>	 I don't have a preference, but I don't think my password would work, because 2FA
[16:06:07] <thcipriani>	 this one doesn't upload for you :)
[16:06:09] <thcipriani>	 makedeploynotes.py 1.32.0-wmf.10 1.32.0-wmf.12 | tee deploy-notes-1.32.0-wmf.12
[16:06:35] <thcipriani>	 and then just copy and paste from that file to https://www.mediawiki.org/wiki/MediaWiki_1.32/wmf.12/Changelog
[16:06:36] <zeljkof>	 thcipriani: that's cool, I can copy/paste like Bruce Lee! :)
[16:08:22] <zeljkof>	 thcipriani: can I just update the docs to use that? :) seems more sane and secure
[16:08:23] <thcipriani>	 added bonus: also you don't need to update your local checkout since that one queries gitiles
[16:08:40] <wikibugs>	 (03PS6) 10Andrew Bogott: prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio)
[16:08:47] <icinga-wm>	 PROBLEM - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:09:30] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394 (10ArielGlenn) So the new disk is showing up as /dev/sdc now, presumably a reimage would straighten everything out.
[16:09:48] <thcipriani>	 zeljkof: sure, that's the one I've been using for a while. Not sure what marxarell.i and twentyafterfou.r do.
[16:10:07] <icinga-wm>	 PROBLEM - Check systemd state on cp3033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:10:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3121: Connection refused
[16:10:16] <zeljkof>	 thcipriani: old docs will be in history, in case they want to go back :)
[16:10:17] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3127: Connection refused
[16:10:32] <zeljkof>	 thcipriani: ok, I'll do my best to update the docs, then ask you to review, sounds good?
[16:10:52] <thcipriani>	 sure
[16:11:08] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3126: Connection refused
[16:11:25] <zeljkof>	 thcipriani: thanks! sorry, it's been a long day and I am totally confused with the process
[16:11:27] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3120: Connection refused
[16:11:27] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3124: Connection refused
[16:11:27] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3125: Connection refused
[16:11:45] <zeljkof>	 this copy/paste your password thing really freaks me out
[16:12:25] <ema>	 cp3033 above is me ^
[16:12:28] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3123: Connection refused
[16:12:37] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3122: Connection refused
[16:12:47] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 80: Connection refused
[16:13:03] <zeljkof>	 thcipriani: so no need to clone core, right? 
[16:13:17] <thcipriani>	 for the python notes script?
[16:13:22] <thcipriani>	 nope, can be run from anywhere
[16:13:30] <zeljkof>	 thcipriani: yes, great, ok
[16:13:33] <zeljkof>	 removing from docs
[16:13:39] <zeljkof>	 and from my home folder :P
[16:14:03] <thcipriani>	 zeljkof: and no worries about the questions. I am fully aware of the potential negative effects the train process can have on one's psyche.
[16:14:11] <thcipriani>	 :)
[16:14:14] <marxarelli>	 zeljkof: you're in good hands with thcipriani :)
[16:14:14] <zeljkof>	 :D
[16:14:23] * marxarelli makes his morning coffee
[16:14:29] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Addshore) The WMDE scripts have requests going to the following places not via the webproxy:  - https://noc.wikimedia.org/conf/dblists/...
[16:14:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio)
[16:15:04] <zeljkof>	 I'll polish to docs so the next person will just have to copy/paste ;)
[16:15:06] <wikibugs>	 (03PS1) 10Ema: Revert "cache_text: add support for alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/444903
[16:15:11] <marxarelli>	 zeljkof: seriously though, i can help with the process this week as well
[16:16:18] <wikibugs>	 (03PS2) 10Ema: Revert "cache_text: add support for alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/444903
[16:17:15] <wikibugs>	 (03CR) 10Ema: [C: 032] Revert "cache_text: add support for alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/444903 (owner: 10Ema)
[16:20:28] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.171 second response time
[16:20:47] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time
[16:20:47] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time
[16:20:47] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.168 second response time
[16:20:48] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops
[16:21:08] <icinga-wm>	 RECOVERY - Check systemd state on cp3033 is OK: OK - running: The system is fully operational
[16:21:48] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.168 second response time
[16:21:57] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time
[16:22:07] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.167 second response time
[16:22:37] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.167 second response time
[16:22:38] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time
[16:22:46] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) >>! In T198623#4412494, @Addshore wrote: > The WMDE scripts have requests going to the following places not via the webproxy: >...
[16:22:56] <wikibugs>	 (03CR) 10EBernhardson: [C: 04-1] "I took a look into the analytics cluster part, what we can do is utilize the HostHeaderSSLAdapter from requests_toolbelt to connect to the" [puppet] - 10https://gerrit.wikimedia.org/r/444610 (owner: 10EBernhardson)
[16:23:24] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) And more redundant, as query.wikidata.org and wikidata.org are load balanced.
[16:25:10] <zeljkof>	 thcipriani: looks good? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Update_deploy_notes
[16:25:18] <icinga-wm>	 RECOVERY - Check systemd state on cp1008 is OK: OK - running: The system is fully operational
[16:25:26] <zeljkof>	 thcipriani: diff https://wikitech.wikimedia.org/w/index.php?title=Heterogeneous_deployment%2FTrain_deploys&type=revision&diff=1796714&oldid=1796711
[16:26:49] <thcipriani>	 zeljkof: lgtm, although I've been running this from my laptop
[16:27:03] <zeljkof>	 thcipriani: ah
[16:27:40] <zeljkof>	 hm, for simpler docs, let's just assume everything is at deployment server
[16:27:47] <zeljkof>	 and then people might do as they want
[16:27:52] <zeljkof>	 I might make a note
[16:28:10] <zeljkof>	 good point, no need for this to be there
[16:28:23] <wikibugs>	 (03PS1) 10Pmiazga: Enable page previews for all new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719)
[16:28:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] "baham depooled successfully :)" [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez)
[16:28:47] <zeljkof>	 thcipriani: also, `on Windows _netrc` makes no sense and can be removed, right? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Setup
[16:28:52] <wikibugs>	 (03PS2) 10Vgutierrez: authdns: Remove baham from the authdns host list [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664)
[16:29:22] <thcipriani>	 zeljkof: sure, I think it must've been copy and paste from some other instructions somewhere
[16:29:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: create profile [puppet] - 10https://gerrit.wikimedia.org/r/444908
[16:29:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: apache-fast-test: read files from the tests directory as a fallback [puppet] - 10https://gerrit.wikimedia.org/r/444909
[16:29:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910
[16:30:10] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837
[16:33:34] <zeljkof>	 thcipriani: makedeploynotes.py complains with `ModuleNotFoundError: No module named 'requests'`
[16:33:46] <zeljkof>	 I guess I have to install dependencies?
[16:34:23] <thcipriani>	 yeah, python3-requests
[16:34:42] <thcipriani>	 although that seems to be installed on deploy1001
[16:34:53] <zeljkof>	 ah, ran it on my machine
[16:34:57] <zeljkof>	 ok, running there :)
[16:35:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) 05Open>03Resolved All good this time ``` root@db1069:~# megacli -LDPDInfo -aAll  Adapter #0  Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name                : RAID Level          :...
[16:36:44] <zeljkof>	 runs fine there
[16:36:55] <thcipriani>	 nice
[16:37:56] <zeljkof>	 thcipriani: it's magic! https://www.mediawiki.org/wiki/MediaWiki_1.32/wmf.12/Changelog
[16:38:24] <vgutierrez>	 baham has been replaced succesfully with authdns2001, merges to operations/dns can be done as usual
[16:38:32] <ema>	 \o/
[16:40:07] <zeljkof>	 thcipriani: uh oh, so I should have done all this _before_ the window :) ok, finally at https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Switch_group0_wikis_to_VERSION
[16:40:52] <thcipriani>	 all if it can be done outside the window. I generally do the branch cut earlier, but, in general, I do everything else inside the window.
[16:41:08] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin)
[16:41:24] <addshore>	 sounds like fun first train zeljkof :D
[16:41:34] <zeljkof>	 addshore: oh yeah! :D
[16:41:55] <addshore>	 I remember reading the docs a few weeks back and being glad I didn't have to do it ;)
[16:42:16] <Lucas_WMDE>	 any estimate how long the remaining train will take? :)
[16:42:26] * bd808 remembers writing the first public docs on a train deploy and being terrified
[16:42:29] <zeljkof>	 addshore: read the docs now, I've been working on them all day, haven't been in a better shape in a long time ;)
[16:42:30] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin)
[16:42:43] <zeljkof>	 Lucas_WMDE: no :D
[16:42:48] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin)
[16:42:48] <Lucas_WMDE>	 ok :D
[16:42:51] <zeljkof>	 estimate: two hours ago
[16:42:58] <zeljkof>	 obviously wrong
[16:43:00] <Lucas_WMDE>	 well, I’ll just wish you good luck then
[16:43:07] <zeljkof>	 thanks :D will need it
[16:43:07] <Lucas_WMDE>	 and see how it goes
[16:43:16] <zeljkof>	 the fist train ever
[16:43:22] <zeljkof>	 (for me, that is)
[16:44:18] <zeljkof>	 bd808: there are still a few things that raise my hair :D but thanks to thcipriani there is one less as of a few minutes ago
[16:45:47] <wikibugs>	 (03PS1) 10Rush: dumps: add labstore1006 back for dumps serving to cloud [puppet] - 10https://gerrit.wikimedia.org/r/444914 (https://phabricator.wikimedia.org/T198420)
[16:47:02] <wikibugs>	 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10mark)
[16:47:17] <wikibugs>	 10Operations, 10monitoring, 10User-fgiunchedi: Handle SMART for multiple shelves attached to a single smartarray controller - https://phabricator.wikimedia.org/T199236 (10Bstorm)
[16:47:32] <wikibugs>	 (03PS2) 10Rush: dumps: add labstore1006 back for dumps serving to cloud [puppet] - 10https://gerrit.wikimedia.org/r/444914 (https://phabricator.wikimedia.org/T198420)
[16:48:11] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Bstorm)
[16:48:23] <wikibugs>	 (03CR) 10Rush: [C: 032] dumps: add labstore1006 back for dumps serving to cloud [puppet] - 10https://gerrit.wikimedia.org/r/444914 (https://phabricator.wikimedia.org/T198420) (owner: 10Rush)
[16:51:47] <logmsgbot>	 !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.12
[16:51:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:45] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests: Request access to data for citation usage research - https://phabricator.wikimedia.org/T198662 (10Miriam) @DarTar could sign off  here to give  @Pirroh access to the data? (See task description) Thanks!
[17:00:04] <jouncebot>	 cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1700).
[17:06:02] <zeljkof>	 thcipriani: I'm... I'm done...? https://www.mediawiki.org/w/index.php?diff=2825343&oldid=2819773&title=MediaWiki_1.32/Roadmap&type=revision&diffmode=source
[17:06:48] <thcipriani>	 looks like it: https://www.mediawiki.org/wiki/Special:Version
[17:06:55] <thcipriani>	 kudos on the first train :)
[17:07:45] <zeljkof>	 thcipriani: party time! :) thanks for all the help! I think it's time for an adult beverage
[17:08:39] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Nemo_bis)
[17:08:52] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Nemo_bis)
[17:11:27] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10BBlack) wgCacheEpoch is probably about the parser cache, which is separate from #Traffic 's Varnish caching.  Either one could be an issue here, or...
[17:19:28] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10matmarex) The redirects (that I know of) were implemented using JavaScript code in MediaWiki:Common.js etc, for example: * https://it.wikipedia.org...
[17:20:02] <wikibugs>	 (03PS1) 10Reedy: Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918
[17:28:15] <wikibugs>	 10Operations, 10Discovery-Search (Current work): migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10Gehel) testing in progress on deployment-prep
[17:40:00] <wikibugs>	 (03PS2) 10Reedy: Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918
[17:54:55] <wikibugs>	 (03PS1) 10Ottomata: Set contact_group to admins for main MirrorMaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/444925
[17:58:37] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Daimona) Several users have reported this problem, however they weren't really redirected: instead, while searching stuff on google, //google redir...
[18:03:45] <wikibugs>	 10Operations: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10faidon) I'm using servermon for fact query regularly, but I think I'm one of the very few :) I admit I haven't played around much with puppetboard to adjust my use cases, so that may be something that could potentially work (with...
[18:12:21] <wikibugs>	 (03CR) 10Anomie: "This either needs to wait for Thursday after the train (SCHEMA_COMPAT_OLD is added in wmf.12), or it needs to use MIGRATION_OLD like the c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore)
[18:13:46] <wikibugs>	 (03CR) 10Alex Monk: [WIP] get rid of openssl CLI usage (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez)
[18:17:54] <wikibugs>	 (03PS3) 10RobH: DNS: Add DNS asset tag mgmt for spare servers [dns] - 10https://gerrit.wikimedia.org/r/441063 (https://phabricator.wikimedia.org/T196666) (owner: 10Papaul)
[18:18:12] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Daimona) I think we already purged common.js several times after the blackout, anyway let's see if it works. As for Google, I...
[18:18:22] <wikibugs>	 (03PS1) 10Aaron Schulz: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932
[18:19:45] <wikibugs>	 (03CR) 10RobH: [C: 032] DNS: Add DNS asset tag mgmt for spare servers [dns] - 10https://gerrit.wikimedia.org/r/441063 (https://phabricator.wikimedia.org/T196666) (owner: 10Papaul)
[18:22:04] <wikibugs>	 (03PS2) 10Aaron Schulz: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239)
[18:22:59] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 04-1] "Oh, you are right! I would have expected Jenkins to notice this... it should be running the deployed branch, not master, on this repo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore)
[18:23:07] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "Requires wmf.13 to be live on meta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy)
[18:24:07] <AaronSchulz>	 Reedy: time for another go methinks
[18:24:15] <wikibugs>	 (03CR) 10Reedy: "It's basically a noop without the code that uses it though" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy)
[18:29:04] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 032] Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[18:30:40] <wikibugs>	 (03Merged) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[18:33:49] <wikibugs>	 (03CR) 10Jforrester: [C: 04-1] "Sure, but we don't want the train to randomly derail because it turns out this code breaks prod in some odd way. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy)
[18:33:54] <logmsgbot>	 !log aaron@deploy1001 Synchronized wmf-config/mc.php: Make all non-test wikis write to both nutcracker and mcrouter again (duration: 00m 57s)
[18:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:34:22] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792 (10Krenair) cleaned that up today per <apergos> Krenair: I finally got a chance to note the config info for the old dumps puppetmaster in...
[18:34:23] <wikibugs>	 (03PS5) 10Alex Monk: Allow PuppetDB use on standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T72792)
[18:35:55] <wikibugs>	 (03CR) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz)
[18:38:03] <wikibugs>	 (03CR) 10Krinkle: "Indeed, mediawiki/includes/Setup.php:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester)
[18:42:33] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) After discussion with @Cmjohnson its been decided we'll go ahead and attempt to get the mainboard replaced before doing the smarthands work i suggested above.  @papaul was onsite and did the steps:...
[18:50:02] <wikibugs>	 (03PS2) 10Jforrester: Cleanup: Stop trying to set wgLocalTZOffset, it's wgLocalTZ*o*ffset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665
[18:51:37] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) {F23557620}  Self dispatch  SR971650695 scheduled, including a request for an onsite technician.  Once they send me the shipping info, I'll open an inbound shipment ticket with eqsin.  I'll then sch...
[18:51:51] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Krinkle) >>! In T199252#4413493, @Daimona wrote: > As for Google, I don't have a link but I can ask for it if you want. What...
[18:53:33] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Cleanup: Stop trying to set wgLocalTZOffset, it's wgLocalTZ*o*ffset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester)
[18:56:32] <wikibugs>	 (03PS1) 10Krinkle: Clean up wgLocaltimezone (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444940
[18:56:34] <wikibugs>	 (03PS1) 10Krinkle: Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941
[19:00:04] <jouncebot>	 Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1900)
[19:00:21] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666 (10RobH)
[19:00:47] <wikibugs>	 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666 (10RobH) 05Open>03Resolved >>! In T196666#4300081, @Papaul wrote: > Switch port information : > both servers are racked in D8  > wmf6652  ge-8/0/3 >...
[19:07:15] <wikibugs>	 (03CR) 10Jforrester: [C: 031] Clean up wgLocaltimezone (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444940 (owner: 10Krinkle)
[19:07:24] <wikibugs>	 (03CR) 10Jforrester: [C: 031] Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 (owner: 10Krinkle)
[19:16:28] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 50.49, 31.62, 23.18
[19:18:37] <wikibugs>	 (03PS1) 10Krinkle: Remove unused $tmarray variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444946 (https://phabricator.wikimedia.org/T189966)
[19:26:01] <wikibugs>	 (03CR) 1020after4: [C: 032] Scap: UpdateInterwikiCache fix subclassing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441571 (https://phabricator.wikimedia.org/T196642) (owner: 10Thcipriani)
[19:27:42] <wikibugs>	 (03Merged) 10jenkins-bot: Scap: UpdateInterwikiCache fix subclassing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441571 (https://phabricator.wikimedia.org/T196642) (owner: 10Thcipriani)
[19:28:03] <wikibugs>	 (03CR) 10jenkins-bot: Scap: UpdateInterwikiCache fix subclassing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441571 (https://phabricator.wikimedia.org/T196642) (owner: 10Thcipriani)
[19:34:06] <wikibugs>	 (03CR) 1020after4: [C: 031] "Can I get someone from SRE to merge this? It's affecting some users' ability to get work done." [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4)
[19:34:17] <wikibugs>	 (03PS2) 1020after4: Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974)
[19:34:45] <wikibugs>	 (03PS1) 10Andrew Bogott: Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947
[19:35:23] <wikibugs>	 (03CR) 10Paladox: [C: 031] Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4)
[19:35:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947 (owner: 10Andrew Bogott)
[19:37:08] <wikibugs>	 (03CR) 1020after4: [C: 031] "https://puppet-compiler.wmflabs.org/compiler03/11762/" [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4)
[19:38:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 52.28, 35.71, 30.45
[19:39:47] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.50, 37.23, 30.35
[19:40:48] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 58.53, 38.14, 27.75
[19:40:54] <wikibugs>	 (03PS2) 10Andrew Bogott: Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947
[19:42:13] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10BBlack) Was the "temporary" JS redirect a 301 perhaps?
[19:43:40] <wikibugs>	 (03CR) 10Framawiki: [C: 031] Create Publisher namespace in Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028) (owner: 10Zoranzoki21)
[19:44:55] <wikibugs>	 (03PS3) 10Andrew Bogott: Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947
[19:46:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947 (owner: 10Andrew Bogott)
[19:48:30] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Krinkle) >>! In T199252#4413739, @BBlack wrote: > Was the "temporary" JS redirect a 301 perhaps?  Nope, it wasn't any form of...
[19:50:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 46.74, 37.57, 32.67
[19:51:18] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 49.94, 30.21, 22.71
[19:52:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 48.63, 33.39, 25.94
[19:52:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:53:08] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 48.58, 30.07, 21.50
[19:53:27] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 54.44, 30.74, 20.24
[19:53:48] <icinga-wm>	 RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 73926 bytes in 0.468 second response time
[19:54:58] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 44.02, 36.55, 28.53
[19:55:08] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 54.10, 35.86, 23.95
[19:55:18] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 53.23, 36.52, 24.95
[19:55:27] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[19:55:38] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 49.35, 34.98, 23.04
[19:55:47] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 54.72, 39.35, 28.24
[19:56:07] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 56.61, 40.33, 29.37
[19:56:38] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[19:56:47] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:56:48] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:56:57] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:56:57] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:57:08] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:57:27] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:57:47] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[19:57:48] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[19:57:58] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:58:17] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:58:28] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[19:58:28] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:59:17] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[20:00:07] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 14.11, 24.69, 21.98
[20:01:17] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[20:01:32] <Krenair>	 well that's rather loud
[20:01:40] <Krenair>	 I wonder if it's important
[20:01:48] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[20:01:48] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[20:01:48] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:01:48] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 15.28, 24.78, 23.66
[20:02:05] <Krenair>	 Krinkle, any idea what's up?
[20:02:20] <Krinkle>	 I do not know.
[20:02:27] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[20:02:29] <Krinkle>	 Checkin logstash
[20:02:47] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:02:47] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[20:03:19] <Krenair>	 https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 looks like there was a brief problem
[20:03:27] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[20:03:27] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:03:37] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:03:37] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[20:03:40] <Krinkle>	 Indeed.
[20:03:45] <Krinkle>	 Edit count also plummeted momentarily
[20:03:46] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/edit-count?orgId=1&from=1531239268571&to=1531253015803
[20:03:56] <wikibugs>	 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955 (10Bstorm)
[20:04:17] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 14.58, 22.44, 23.45
[20:04:18] <Krinkle>	 save timing went up significantly  (4X) - https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1531247940501&to=1531253040967
[20:05:01] <Krinkle>	 Error logs show a significant ERROR increase
[20:05:02] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/production-logging?orgId=1&from=1531227634440&to=1531252968550
[20:05:15] <Krinkle>	 Sustained for about 2 hours
[20:08:57] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[20:09:58] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[20:10:04] <Krinkle>	 It seems in Logstash, the majority of errors are:
[20:10:06] <Krinkle>	 > Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR
[20:10:08] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[20:10:18] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[20:10:18] <Krinkle>	 4,700 hits in 15min
[20:10:37] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 13.81, 19.00, 23.96
[20:10:58] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[20:11:07] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[20:11:17] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 11.87, 17.54, 23.91
[20:11:29] <Krinkle>	 also lots of
[20:11:30] <Krinkle>	 > Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
[20:11:53] <Krinkle>	 AaronSchulz: _joe_ : moritzm 
[20:12:38] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 13.96, 17.73, 23.99
[20:13:57] <AaronSchulz>	 Krinkle: some kind of timeout issue? Odd.
[20:14:04] <Krinkle>	 The top key in logstash/memcached is 'wikibase_shared/1_32_0-wmf_10-wikidatawiki-hhvm:CacheAwarePropertyInfoStore'
[20:14:17] <Krinkle>	 Which follows the shape of the overall error burst exactly
[20:15:38] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 10.82, 14.76, 23.67
[20:18:50] <Krinkle>	 AaronSchulz: So the mcrouter write is now live for all wikis, right?
[20:19:27] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 13.82, 15.58, 23.78
[20:20:03] <Krinkle>	 It's still affecting snapshot100x hosts a lot. It used to have ~0 errors in channel:memcached, now 1000 per 5min and counting.
[20:20:30] <Krinkle>	 AaronSchulz: https://logstash.wikimedia.org/goto/a610950920965b1bb57e0c50a7130cc3
[20:20:50] <AaronSchulz>	 yeah, the snapshot1001 thing is weird
[20:21:18] <Krinkle>	 Looking at all other servers only, it seems to have recovered.
[20:21:34] <Krinkle>	 > https://logstash.wikimedia.org/goto/85796fdc0d11e4bc3636650f7e18928e
[20:24:24] <wikibugs>	 (03CR) 10Jdlrobson: [C: 031] Enable page previews for all new editors (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga)
[20:28:08] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.35, 14.20, 23.88
[20:30:25] <AaronSchulz>	 Krinkle: I see that https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/442813/ is still not in wmf10
[20:31:07] <AaronSchulz>	 I wonder if some lock()/add() thing caused some vimportant value to never be updated and expire or something
[20:32:54] <Krinkle>	 indeed
[20:32:54] <AaronSchulz>	 Krinkle: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/445005/
[20:33:14] <Krinkle>	 lock/add() just fails deterministically every time for some keys.
[20:34:26] <AaronSchulz>	 the new version is in wmf12 at least (which is running in group0)
[20:35:42] <Krinkle>	 AaronSchulz: Yeah
[20:35:53] <Krinkle>	 AaronSchulz: Are there any other objectcache related patches not in wmf12 yet?
[20:35:55] <Krinkle>	 wmf10 *
[20:35:59] <Krinkle>	 Or was this the only one?
[20:36:16] <Krinkle>	 Just thinking whether we should do more at the same time and/or rollback until wmf12 is everywhere
[20:36:20] <AaronSchulz>	 just those two (the first one I already backported)
[20:36:24] <Krinkle>	 OK
[20:36:26] <Krinkle>	 let's do it
[20:36:29] <AaronSchulz>	 (that was the encoding one)
[20:37:17] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.35, 18.15, 23.82
[20:42:33] <wikibugs>	 (03PS3) 10Mforns: Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176)
[20:43:01] <wikibugs>	 (03PS4) 10Mforns: Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176)
[20:45:03] * AaronSchulz wonders if there are any projects to make jenkins faster
[20:45:30] <wikibugs>	 (03CR) 10Mforns: "This should be good to go. Also, the backfilling went up to 7th of July, so when we merge this, it will catch up auotmatically since then." [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns)
[20:48:42] <logmsgbot>	 !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/includes/libs/objectcache/MultiWriteBagOStuff.php: 4fba9f6a032 (duration: 00m 57s)
[20:48:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:40] <logmsgbot>	 !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/tests/phpunit/includes/libs/objectcache/MultiWriteBagOStuffTest.php: 4fba9f6a032 (duration: 00m 56s)
[20:50:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:48] <wikibugs>	 10Operations, 10Cloud-VPS, 10procurement: rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10chasemp) a:05chasemp>03Andrew
[21:04:33] <XioNoX>	 !log re-configure GTT circuit in eqiad/knams
[21:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:36] <Krinkle>	 (continuing in -perf)
[21:07:07] <wikibugs>	 (03PS1) 10Rush: openstack: eqiad1-r metadata agent for net role [puppet] - 10https://gerrit.wikimedia.org/r/445020 (https://phabricator.wikimedia.org/T196633)
[21:08:28] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: eqiad1-r metadata agent for net role [puppet] - 10https://gerrit.wikimedia.org/r/445020 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush)
[21:12:51] <wikibugs>	 (03PS1) 10Andrew Bogott: vmbuilder and bootstrapvz: get hostname from metadata [puppet] - 10https://gerrit.wikimedia.org/r/445021
[21:13:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] vmbuilder and bootstrapvz: get hostname from metadata [puppet] - 10https://gerrit.wikimedia.org/r/445021 (owner: 10Andrew Bogott)
[21:16:04] <wikibugs>	 (03PS1) 10Rush: openstack: eqiad1 deployment net role notes [puppet] - 10https://gerrit.wikimedia.org/r/445022
[21:18:54] <wikibugs>	 (03PS1) 10Rush: openstack: add labnet100[34] VLAN 1120 reservations [dns] - 10https://gerrit.wikimedia.org/r/445023 (https://phabricator.wikimedia.org/T196633)
[21:24:37] <icinga-wm>	 PROBLEM - nutcracker process on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:24:37] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[21:24:38] <icinga-wm>	 PROBLEM - apertium apy on scb2001 is CRITICAL: HTTP CRITICAL - No data received from host
[21:24:58] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/news (get In the New
[21:24:58] <icinga-wm>	 ported language (with aggregated=true)) timed out before a response was received
[21:25:28] <icinga-wm>	 RECOVERY - nutcracker process on scb2001 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker
[21:25:37] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[21:25:47] <icinga-wm>	 RECOVERY - apertium apy on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.076 second response time
[21:25:58] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[21:26:28] <wikibugs>	 (03PS2) 10Rush: openstack: eqiad1 deployment net role notes [puppet] - 10https://gerrit.wikimedia.org/r/445022
[21:27:42] <wikibugs>	 (03CR) 10Rush: [C: 032] openstack: eqiad1 deployment net role notes [puppet] - 10https://gerrit.wikimedia.org/r/445022 (owner: 10Rush)
[21:31:13] <greg-g>	 AaronSchulz: Krinkle ya'll OK?
[21:32:17] <Krinkle>	 greg-g: Yeah, seems the mcrouter deployment for all wikis went ahead of a critical fix for the memc_add() function, which we knew about and fixed last week, but forgot to backport given the train is offset by one week from the original  schedule.
[21:32:30] <Krinkle>	 Caused a cascading failure we'll write up later, but for now, things are fine.
[21:32:37] <Krinkle>	 I've shared a preliminary write up with you.
[21:33:41] <greg-g>	 Krinkle: ack, thanks.
[21:40:03] <wikibugs>	 (03PS1) 10Reedy: Stop logging email changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445030
[21:40:09] <chasemp>	 !log reboot labnet100[34]
[21:40:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:56:34] <wikibugs>	 (03PS2) 10Catrope: Rollout Watchlist Structured Filters to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo)
[22:13:22] <wikibugs>	 (03PS1) 10Thcipriani: Scap: Bump version to 3.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/445031 (https://phabricator.wikimedia.org/T199283)
[22:14:48] <wikibugs>	 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Services (blocked): Update Debian Package for Scap3 to 3.8.4-1 - https://phabricator.wikimedia.org/T199283 (10thcipriani) Adding #Operations for the package update + puppet patch.
[22:21:47] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:22:01] <Reedy>	 jouncebot: next
[22:22:02] <jouncebot>	 In 0 hour(s) and 37 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T2300)
[22:48:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:56:00] <wikibugs>	 (03PS1) 10Jdlrobson: Scrub ambox images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445036
[23:00:04] <jouncebot>	 addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:07:31] <wikibugs>	 (03CR) 10Pmiazga: "Jdlrobson - that's correct, but as Olga said, we want that as a default behavior no matter what. Popups are visible to anons by default, i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga)
[23:40:47] <wikibugs>	 (03CR) 10Jdlrobson: [C: 031] "I read Olga's comment to mean all wikis that have the feature enabled already. Note we haven't spoken to any of the projects that don't ha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga)