[00:13:13] (03PS1) 10Bstorm: openstreetmap: add debian stretch to puppet role [puppet] - 10https://gerrit.wikimedia.org/r/444771 (https://phabricator.wikimedia.org/T197246) [00:20:04] (03CR) 10Bstorm: [C: 032] openstreetmap: add debian stretch to puppet role [puppet] - 10https://gerrit.wikimedia.org/r/444771 (https://phabricator.wikimedia.org/T197246) (owner: 10Bstorm) [00:21:01] !log deploying https://phabricator.wikimedia.org/rPHABc6f75c918afa1cc59472c5fe226539e093f6c3ef [00:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:04] (03PS1) 10Reedy: Remove wgTidyConfig; same as DefaultSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444775 [00:42:44] (03PS1) 10Reedy: Timeless is enabled everywhere, remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444776 [00:56:12] hi. i've been banned from phabricator? :( [00:56:16] TOO MANY REQUESTS [00:56:16] You ("185.157.12.102") are issuing too many requests too quickly. [00:56:33] that is all i get when trying to view any page. i was just browsing it like every day. [00:56:59] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [01:00:35] well, i can view it again. i don't know if you did anything or if it fixed itself. [01:00:53] i'd still be curious to know how i managed to hit a rate limit [01:01:20] :S [01:02:42] (03CR) 10Smalyshev: [C: 031] "Seems to work ok for tests" [puppet] - 10https://gerrit.wikimedia.org/r/444265 (owner: 10Smalyshev) [01:11:22] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [01:12:00] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:15:19] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [01:21:49] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:22:59] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [01:31:19] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (bad seed) timed out before a response was received [01:33:30] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [01:34:12] (03PS10) 10Zoranzoki21: Create Publisher namespace in Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028) [01:35:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [01:35:51] (03CR) 10Zoranzoki21: [C: 031] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [01:37:59] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:39:49] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [01:42:30] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [01:45:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:48:40] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [01:49:20] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [01:53:49] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [01:55:20] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:57:09] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [01:57:39] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [01:58:10] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [02:00:59] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:01:29] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:02:09] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [02:04:50] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [02:05:20] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:08:09] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:09:50] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [02:11:30] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [02:13:19] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) timed out before a response was received [02:17:10] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:18:50] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [02:22:10] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:25:29] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [02:25:30] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [02:26:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:27:19] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [02:28:50] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:30:30] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:34:19] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=misc&var-status_type=5 [02:35:00] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:38:59] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [02:43:29] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:44:29] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jul 10 02:44:29 UTC 2018 (duration 10m 18s) [02:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:10] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [02:50:39] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [02:57:19] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [03:00:39] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [03:02:40] PROBLEM - eventstreams on scb2002 is CRITICAL: connect to address 10.192.48.43 and port 8092: Connection refused [03:03:29] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [03:04:00] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [03:06:40] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [03:10:49] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [03:11:19] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy [03:11:50] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [03:21:09] PROBLEM - puppet last run on scb2006 is CRITICAL: CRITICAL: Puppet has 19 failures. Last run 5 minutes ago with 19 failures. Failed resources (up to 3 shown): Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[recommendation-api/deploy],Exec[chown /srv/deployment/recommendation-api for deploy-service],Package[mobileapps/deploy] [03:24:50] RECOVERY - eventstreams on scb2002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.093 second response time [03:29:57] TOO MANY REQUESTS You ("217.209.178.82") are issuing too many requests too quickly. ---- lol [03:31:40] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [03:43:44] apparently there's a task about the phab rate limit i was complaining about: https://phabricator.wikimedia.org/T198974 [03:43:51] Josve05a: ^ [03:44:44] i was creating a task/ticket with exception crash code :/ All lost [03:46:39] RECOVERY - puppet last run on scb2006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [04:13:41] uhh, somehow I've hit the rate limit for Phabricator (just normal browsing) [04:13:47] any ideas? [04:14:11] When I visit any page I get the error: [04:14:17] TOO MANY REQUESTS [04:14:34] You (IP ADDRESS REDACTED) are issuing too many requests too quickly. [04:14:53] hmm now it works (so far) [04:15:23] T198974 [04:15:24] T198974: Rate-limit is too harsh - https://phabricator.wikimedia.org/T198974 [04:15:57] You are not the first person to run into this tonight, the time period on the rate limit is fairly short though [04:16:59] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [04:19:20] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [04:22:06] 08Warning Alert for device cr2-esams.wikimedia.org - Inbound interface errors [04:23:10] ah interesting [04:28:57] !log Deploy schema change on s1 primary master (db1052) T146591 T197891 T196379 [04:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:29:02] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [04:29:03] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [04:29:03] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [04:36:43] !log Optimize bgwiki itwiki svwiki zhwiki wbc_entity_usage on db1066 (s2 primary master) - T187521 [04:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:36:47] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [04:40:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591) [04:45:01] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) 05Resolved>03Open Actually it this disk has smart errors too. Was this a re-used or a new disk, @Cmjohnson? ``` PD: 0 Information Enclosure Device ID: 32 Slot Number: 0 Drive's position: DiskGrou... [04:45:21] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,0 instance=db1069:9100 job=node site=eqiad Marostegui T199056 - The acknowledgement expires at: 2018-07-13 04:45:07. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [04:45:37] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [04:47:20] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [04:48:43] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1084 for alter table (duration: 00m 52s) [04:48:45] !log Deploy schema change on db1084 T146591 T197891 T196379 [04:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:50] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [04:48:51] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [04:48:51] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [04:50:01] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783 [04:53:58] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783 (owner: 10Marostegui) [04:55:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783 (owner: 10Marostegui) [04:57:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1084 after alter table (duration: 00m 50s) [04:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591) [04:59:28] !log Optimize frwiki.wbc_entity_usage on s6 codfw, this will generate lag on s6 codfw - T187521 [04:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:32] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [05:01:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:01:29] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [05:01:39] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [05:02:46] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:03:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 for alter table (duration: 00m 51s) [05:03:58] !log Deploy schema change on db1097:3314 T146591 T197891 T196379 [05:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:04] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:04:04] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:04:05] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:04:17] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785 [05:07:05] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785 (owner: 10Marostegui) [05:08:52] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785 (owner: 10Marostegui) [05:10:49] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 after alter table (duration: 00m 50s) [05:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591) [05:13:52] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:15:05] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:15:24] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789 [05:16:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3314 for alter table (duration: 00m 50s) [05:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:41] !log Deploy schema change on db1103:3314 T146591 T197891 T196379 [05:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:48] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:16:48] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:16:49] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:17:21] !log Optimize frwiki.wbc_entity_usage on s6 eqiad hosts T187521 [05:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:24] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [05:17:46] (03CR) 10Krinkle: "Might be better to use unprefixed variables for these, or wmg* prefix. That way they won't clash with wg* and/or be accessible through Con" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [05:17:55] (03CR) 10Krinkle: [C: 031] Do not leak local $wgWBShared… variables to th eglobal scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [05:18:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789 (owner: 10Marostegui) [05:20:03] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789 (owner: 10Marostegui) [05:21:08] !log Deploy schema change on db1121 with replication, this will generate lag on s4 labs hosts T146591 T197891 T196379 [05:21:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3314 after alter table (duration: 00m 50s) [05:21:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:06] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Inbound interface errors [05:23:03] (03PS1) 10Marostegui: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591) [05:25:02] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:26:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:27:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1081 for alter table (duration: 00m 50s) [05:27:24] !log Deploy schema change on db1081 T146591 T197891 T196379 [05:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:30] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:27:30] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:27:30] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:28:03] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791 [05:29:28] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791 (owner: 10Marostegui) [05:31:04] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791 (owner: 10Marostegui) [05:32:10] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591) [05:32:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1081 after alter table (duration: 00m 49s) [05:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:35:41] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [05:35:59] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793 [05:36:41] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 for alter table (duration: 00m 50s) [05:36:42] !log Deploy schema change on db1091 T146591 T197891 T196379 [05:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:36:48] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:36:48] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:36:48] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:37:50] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793 (owner: 10Marostegui) [05:39:33] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793 (owner: 10Marostegui) [05:39:59] !log Deploy schema change on s4 primary master (db1068) T146591 T197891 T196379 [05:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:40:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 after alter table (duration: 00m 50s) [05:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:29] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [06:02:16] !log Deploy schema change on codfw s8 master (db2045) with replication, this will generate lag on s8 codfw T146591 T197891 T196379 [06:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:22] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [06:02:22] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [06:02:22] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [06:04:29] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:06:35] !log Deploy schema change on dbstore1002:s8 T146591 T197891 T196379 [06:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:07:36] (03PS1) 10Elukey: Add interface::add_ip6_mapped to stat* hosts [puppet] - 10https://gerrit.wikimedia.org/r/444795 (https://phabricator.wikimedia.org/T199180) [06:07:44] poor dbstore1002, nobody leaves it alone :D [06:08:33] it is good for it! [06:12:01] hahahaha [06:12:13] new indexes, PKs... [06:12:18] all good for our dbstore1002 [06:16:40] (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to stat* hosts [puppet] - 10https://gerrit.wikimedia.org/r/444795 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [06:17:43] !log Deploy schema change on s8 primary master (db1071) T146591 T197891 T196379 [06:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:48] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [06:17:48] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [06:17:49] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [06:21:00] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [06:23:50] !log Deploy schema change on codfw s7 master (db2040) with replication, this will generate lag on s7 codfw T146591 T197891 T196379 [06:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:55] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [06:23:56] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [06:23:56] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [06:29:30] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl],File[/usr/local/lib/nagios/plugins/check_long_procs] [06:32:37] !log Optimize wbc_entity_usage on arwiki cawiki huwiki rowiki ukwiki on s7 codfw master (db2040) with replication, this will generate lag on s7 codfw - T187521 [06:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:40] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [06:33:50] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [06:37:35] (03PS1) 10Elukey: Add interface::add_ip6_mapped to analytics100[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/444796 (https://phabricator.wikimedia.org/T199180) [06:39:03] (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to analytics100[2,3] [puppet] - 10https://gerrit.wikimedia.org/r/444796 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [06:39:31] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:49:18] (03PS1) 10Jcrespo: mariadb package: Update stretch package to the latest version [software] - 10https://gerrit.wikimedia.org/r/444797 [06:49:48] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb package: Update stretch package to the latest version [software] - 10https://gerrit.wikimedia.org/r/444797 (owner: 10Jcrespo) [06:53:19] !log deployed rPHEX03173dd0097451f60faa3a1705abee58a9fe4c5f [06:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:44] (03PS1) 10Jcrespo: Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798 [06:54:35] (03PS1) 10Elukey: Add interface::add_ip6_mapped to analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/444799 (https://phabricator.wikimedia.org/T199180) [06:56:24] (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to analytics1001 [puppet] - 10https://gerrit.wikimedia.org/r/444799 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [06:58:03] (03CR) 10Marostegui: [C: 031] Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo) [06:58:16] (03PS3) 10Giuseppe Lavagetto: mediawiki: start splitting up remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/444184 [06:58:18] (03PS3) 10Giuseppe Lavagetto: mediawiki: unify the small private wikis definitions [puppet] - 10https://gerrit.wikimedia.org/r/444185 [06:58:20] (03PS3) 10Giuseppe Lavagetto: mediawiki: move private wikis to a separate virtual host [puppet] - 10https://gerrit.wikimedia.org/r/444186 [06:58:22] (03PS3) 10Giuseppe Lavagetto: mediawiki: split all of remnant.conf into individual vhosts [puppet] - 10https://gerrit.wikimedia.org/r/444187 [06:58:24] (03PS2) 10Giuseppe Lavagetto: mediawiki_test: split wikimania.conf [puppet] - 10https://gerrit.wikimedia.org/r/444240 [06:58:27] (03PS2) 10Giuseppe Lavagetto: mediawiki_test: complete the transition to one wiki per template. [puppet] - 10https://gerrit.wikimedia.org/r/444241 [07:00:01] (03CR) 10Jcrespo: "I am setting up the backups for the zarceillo database tables before I can deploy it." [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo) [07:00:42] (03CR) 10jerkins-bot: [V: 04-1] mediawiki_test: complete the transition to one wiki per template. [puppet] - 10https://gerrit.wikimedia.org/r/444241 (owner: 10Giuseppe Lavagetto) [07:04:30] (03PS1) 10Jcrespo: mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) [07:07:47] (03CR) 10Jcrespo: "Let's deploy this and let's take at least one correct backup before deploying gerrit:444798." [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [07:07:54] (03PS1) 10Muehlenhoff: Update account date for mpany, remove two fund raising contractors [puppet] - 10https://gerrit.wikimedia.org/r/444801 [07:08:40] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[hadoop-hdfs-zkfc-init] [07:09:31] this is me, fixing in a sec --^ [07:09:31] (03PS1) 10Elukey: network::constants: update ip6 addresses for hadoop master nodes [puppet] - 10https://gerrit.wikimedia.org/r/444802 (https://phabricator.wikimedia.org/T199180) [07:10:25] (03CR) 10Elukey: [C: 032] network::constants: update ip6 addresses for hadoop master nodes [puppet] - 10https://gerrit.wikimedia.org/r/444802 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [07:12:23] (03CR) 10Marostegui: [C: 031] mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [07:13:34] (03PS4) 10Giuseppe Lavagetto: mediawiki: start splitting up remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/444184 [07:13:53] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki: start splitting up remnant.conf [puppet] - 10https://gerrit.wikimedia.org/r/444184 (owner: 10Giuseppe Lavagetto) [07:16:02] (03PS2) 10Jcrespo: mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) [07:16:28] (03CR) 10Jcrespo: [C: 032] mariadb: Setup backups for zarcillo database on tendril [puppet] - 10https://gerrit.wikimedia.org/r/444800 (https://phabricator.wikimedia.org/T198987) (owner: 10Jcrespo) [07:18:34] 04Critical Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Critical syslog messages [07:18:50] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [07:19:38] (03PS1) 10Muehlenhoff: Add Debian conditional for prometheus-mysqld-exporter service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/444804 [07:22:40] so show log messages on asw2-a returns stuff like [07:22:48] Jul 10 07:13:37 asw2-a-eqiad fpc8 Rear QSFP+ PIC Chan# 3: Tx laser fault cleared [07:22:51] Jul 10 07:13:37 asw2-a-eqiad fpc8 Rear QSFP+ PIC Chan# 3: Rx loss cleared [07:24:00] akosiaris, paravoid --^ [07:24:13] it is probably nothing but better to triple check :) [07:25:21] elukey: you logged in after seeing the critical message here, right? [07:26:25] yep yep, I am aware of the issue happened to Chase [07:26:29] ack :) [07:26:55] https://librenms.wikimedia.org/device/device=160/tab=health/metric=storage/ would probably need a check too (just seen it passing by) [07:27:27] it's kinda weird.. the alert event triggered at 07:18 according to https://librenms.wikimedia.org/device/device=160/tab=logs/section=eventlog/ [07:27:43] 2018-07-10 07:18:01 System Issued critical alert for rule 'Critical syslog messages' to transport 'irc' [07:28:10] it might collect stuff from prev mins and then if something looks weird it alarms? [07:28:18] but the last critical is from 07:13 [07:28:34] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Critical syslog messages [07:28:39] elukey: probably [07:28:57] https://librenms.wikimedia.org/alert-rules/ [07:29:05] %syslog.timestamp >= %macros.past_5m && %syslog.priority = "crit" && %syslog.msg !~ "preauth" && %syslog.msg !~ "ipc_version_icu_bypass" [07:29:18] vgutierrez: --^ [07:29:43] all right seems nothing happened, will wait for the experts to confirm :) [07:29:47] Cc: XioNoX [07:29:58] (brb) [07:33:13] (03CR) 10Volans: Remove .hosts files, update tendril instead (031 comment) [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo) [07:33:58] !log deploying fix for phabricator userpage having hard-coded my username [07:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:45] (03PS2) 10Muehlenhoff: Add Debian conditional for prometheus-mysqld-exporter service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/444804 [07:36:03] (03CR) 10Muehlenhoff: [C: 032] Add Debian conditional for prometheus-mysqld-exporter service_auto_restart [puppet] - 10https://gerrit.wikimedia.org/r/444804 (owner: 10Muehlenhoff) [07:36:31] 10Operations, 10Phabricator: Getting 'TOO MANY REQUESTS' error - https://phabricator.wikimedia.org/T199184 (10MarcoAurelio) I guess this is implemented via some puppet config. Please revert if I am mistaken. [07:39:43] (03PS2) 10Muehlenhoff: Update account date for mpany, remove two fund raising contractors [puppet] - 10https://gerrit.wikimedia.org/r/444801 [07:41:09] RECOVERY - puppet last run on dbstore1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:42:51] (03PS2) 10Vgutierrez: [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 [07:43:23] (03CR) 10Muehlenhoff: [C: 032] Update account date for mpany, remove two fund raising contractors [puppet] - 10https://gerrit.wikimedia.org/r/444801 (owner: 10Muehlenhoff) [07:43:50] (03CR) 10jerkins-bot: [V: 04-1] [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [07:44:20] (03CR) 10Vgutierrez: [WIP] get rid of openssl CLI usage (034 comments) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [07:46:39] (03PS1) 10Elukey: Add interface::add_ip6_mapped to all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/444805 (https://phabricator.wikimedia.org/T199180) [07:47:21] (03PS2) 10Elukey: Add interface::add_ip6_mapped to all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/444805 (https://phabricator.wikimedia.org/T199180) [07:49:54] (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to all the Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/444805 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [07:50:26] moritzm: can I merge yours too? [07:51:05] (03CR) 10Jcrespo: Remove .hosts files, update tendril instead (031 comment) [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo) [07:54:54] ah, yes. please go ahead [07:58:09] !log installing ntp security updates on trusty (may trigger some Icinga warnings about clocks, these recover after a while) [07:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:13] ack! [08:03:03] 10Operations, 10Phabricator: Getting 'TOO MANY REQUESTS' error - https://phabricator.wikimedia.org/T199184 (10Jc86035) [08:03:30] 10Operations, 10Phabricator: Rate-limit is too harsh - https://phabricator.wikimedia.org/T198974 (10Jc86035) [08:04:49] (03PS1) 10Joal: role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/444807 [08:04:58] elukey: --^ [08:05:02] if you have a minute [08:05:10] 10Operations, 10Phabricator: Rate-limit is too harsh - https://phabricator.wikimedia.org/T198974 (10Jc86035) [08:07:03] 10Operations, 10Phabricator: Rate-limit is too harsh and affects human users - https://phabricator.wikimedia.org/T198974 (10Jc86035) [08:08:00] (03CR) 10Elukey: [C: 032] role::common::aqs: update druid mediawiki's datasource [puppet] - 10https://gerrit.wikimedia.org/r/444807 (owner: 10Joal) [08:08:26] (03PS1) 10Jcrespo: wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808 [08:08:38] !log installing openslp security updates on trusty [08:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:51] (03CR) 10Jcrespo: "Happy now?" [puppet] - 10https://gerrit.wikimedia.org/r/444808 (owner: 10Jcrespo) [08:09:05] !log disable puppet on hosts running cassandra before merging 444247 and 443114 [08:09:06] (03CR) 10jerkins-bot: [V: 04-1] wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808 (owner: 10Jcrespo) [08:09:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] (03PS6) 10Filippo Giunchedi: restbase: cleanup remaining detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [08:09:49] 10Operations, 10Phabricator: Rate-limit is too harsh and affects human users - https://phabricator.wikimedia.org/T198974 (10Jc86035) [08:09:51] (03CR) 10Filippo Giunchedi: [C: 032] restbase: cleanup remaining detritus from storage transition [puppet] - 10https://gerrit.wikimedia.org/r/443114 (https://phabricator.wikimedia.org/T191659) (owner: 10Eevans) [08:09:58] joal: merged but before rolling out we'd need to wait a sec for Filippo rolling out --^ [08:10:36] elukey: hah! feel free to reenable puppet where you want btw, I disabled as a precaution but it is a noop on e.g. aqs [08:10:54] ah okok! Will run it there and check then :) [08:11:08] (03CR) 10Filippo Giunchedi: [C: 032] RESTBase: Disable cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/444247 (https://phabricator.wikimedia.org/T186567) (owner: 10Mobrovac) [08:11:17] (03PS2) 10Filippo Giunchedi: RESTBase: Disable cassandra-metrics-collector [puppet] - 10https://gerrit.wikimedia.org/r/444247 (https://phabricator.wikimedia.org/T186567) (owner: 10Mobrovac) [08:11:36] Thanks elukey and godog :) [08:11:53] 10Operations, 10Phabricator: Rate-limit is too harsh and affects human users - https://phabricator.wikimedia.org/T198974 (10Jc86035) [08:12:46] (03PS2) 10Jcrespo: wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808 [08:14:16] joal: aqs1004 running with the new config, if ok I'll complete the restarts [08:15:01] elukey: 1 min for me to check please :) [08:15:32] (03PS2) 10Jcrespo: Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798 [08:16:39] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:16:46] that's me ^ [08:16:52] volans: please review https://gerrit.wikimedia.org/r/444808 [08:17:06] (03CR) 10jenkins-bot: Cleanup: Remove wgWikiEditorFeatures, dropped in master in Ia1eb91d2d [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444655 (owner: 10Jforrester) [08:17:08] (03CR) 10jenkins-bot: Cleanup: Remove Beta Cluster use of wikieditor-preview preference, no longer around [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444656 (owner: 10Jforrester) [08:17:12] (03CR) 10jenkins-bot: Cleanup: Stop setting wmgVisualEditorNonAccountEnableProportion to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444657 (owner: 10Jforrester) [08:17:14] (03CR) 10jenkins-bot: Cleanup: Stop setting wgVisualEditorNonAccountEnableProportion, dropped in master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444658 (owner: 10Jforrester) [08:17:16] (03CR) 10jenkins-bot: Cleanup: Stop setting wgTmhEnableMp3Uploads, dropped ages ago [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444659 (owner: 10Jforrester) [08:17:18] (03CR) 10jenkins-bot: Cleanup: Stop setting wmgTmhEnableMp3Uploads, default true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444660 (owner: 10Jforrester) [08:17:20] (03CR) 10jenkins-bot: Cleanup: No need for officewiki-specific upload for MP3s any more [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444661 (owner: 10Jforrester) [08:17:22] (03CR) 10jenkins-bot: Cleanup: Stop trying to set wgLicenseURL, never read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444663 (https://phabricator.wikimedia.org/T154069) (owner: 10Jforrester) [08:17:24] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444781 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:17:26] (03PS1) 1020after4: Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) [08:17:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1084" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444783 (owner: 10Marostegui) [08:17:30] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444784 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:17:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444785 (owner: 10Marostegui) [08:17:34] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444787 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:17:36] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444789 (owner: 10Marostegui) [08:17:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444790 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:17:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444791 (owner: 10Marostegui) [08:17:42] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444792 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [08:17:44] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444793 (owner: 10Marostegui) [08:17:46] (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part I [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441519 (owner: 10Jforrester) [08:17:48] (03CR) 10jenkins-bot: Remove unnecessary code: $wgTidyConfig can never be null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444650 (owner: 10C. Scott Ananian) [08:17:49] RECOVERY - Check systemd state on restbase2001 is OK: OK - running: The system is fully operational [08:17:50] (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441518 (owner: 10Jforrester) [08:17:52] (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441520 (owner: 10Jforrester) [08:17:54] (03CR) 10jenkins-bot: Stop loading the MwEmbedSupport extension, part IV [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441521 (owner: 10Jforrester) [08:19:58] elukey: Success ! [08:20:06] elukey: we can continue to rollout5 [08:20:48] joal: I'll also add ip6 addresses too [08:21:54] (03PS1) 10Elukey: Add interface::add_ip6_mapped to aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/444811 (https://phabricator.wikimedia.org/T199180) [08:22:41] (03PS1) 10Muehlenhoff: Add library hint for libjpeg-turbo [puppet] - 10https://gerrit.wikimedia.org/r/444812 [08:24:59] PROBLEM - Check systemd state on restbase2010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:25:03] (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to aqs nodes [puppet] - 10https://gerrit.wikimedia.org/r/444811 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [08:26:07] (03PS2) 10Muehlenhoff: Add library hint for libjpeg-turbo [puppet] - 10https://gerrit.wikimedia.org/r/444812 [08:26:56] (03CR) 10Muehlenhoff: [C: 032] Add library hint for libjpeg-turbo [puppet] - 10https://gerrit.wikimedia.org/r/444812 (owner: 10Muehlenhoff) [08:32:23] (03CR) 10Jcrespo: [C: 032] wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808 (owner: 10Jcrespo) [08:32:31] (03PS3) 10Jcrespo: wmf_root_clients: Productionize mysql.py wrapper [puppet] - 10https://gerrit.wikimedia.org/r/444808 [08:34:44] !log rolling restart of AQS to apply the new config [08:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:59] RECOVERY - Check systemd state on restbase2010 is OK: OK - running: The system is fully operational [08:37:24] !log drain and restart cassandra-a on restbase2001 to test a restart [08:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:35] !log installing libjpeg-turbo security updates on trusty [08:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:10] PROBLEM - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:40:19] PROBLEM - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is CRITICAL: connect to address 10.192.16.162 and port 9042: Connection refused [08:40:52] (03CR) 10Jcrespo: [C: 032] Remove .hosts files, update tendril instead [software] - 10https://gerrit.wikimedia.org/r/444798 (owner: 10Jcrespo) [08:42:55] expected ^ [08:45:06] should be recovering shortly [08:45:13] cool [08:45:27] godog: i will also restart rb on that node to check [08:46:08] mobrovac: ok! I skipped rb because iirc the change didn't affect anything about it [08:46:24] true, but let's be sure :) [08:46:59] RECOVERY - cassandra-a SSL 10.192.16.162:7001 on restbase2001 is OK: SSL OK - Certificate restbase2001-a valid until 2020-06-24 13:01:24 +0000 (expires in 715 days) [08:47:00] RECOVERY - cassandra-a CQL 10.192.16.162:9042 on restbase2001 is OK: TCP OK - 0.036 second response time on 10.192.16.162 port 9042 [08:47:33] ok all good for rb as well [08:51:06] \o/ [08:53:00] PROBLEM - Check systemd state on restbase-dev1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:53:59] (03PS1) 10Muehlenhoff: Add library hints for libsoup2.4 [puppet] - 10https://gerrit.wikimedia.org/r/444814 [08:58:07] checking ^ [08:59:00] (03PS2) 10Muehlenhoff: Add library hints for libsoup2.4 [puppet] - 10https://gerrit.wikimedia.org/r/444814 [08:59:54] (03CR) 10Muehlenhoff: [C: 032] Add library hints for libsoup2.4 [puppet] - 10https://gerrit.wikimedia.org/r/444814 (owner: 10Muehlenhoff) [08:59:55] mobrovac: ah yeah that's because of a missing 'systemctl reset-failed cassandra-metrics-collector' I believe [09:00:03] I'll run it on the rest of the dev cluster [09:00:10] ah ok :) [09:00:23] indeed it is [09:00:33] cassandra-metrics-collector.service not-found failed failed [09:00:49] RECOVERY - Check systemd state on restbase-dev1005 is OK: OK - running: The system is fully operational [09:01:11] 10Operations, 10ops-eqiad: Relabel labvirt1021.eqiad.wmnet as cloudvirt1021.eqiad.wmnet - https://phabricator.wikimedia.org/T199132 (10aborrero) [09:02:30] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10ema) >>! In T189290#4053740, @ema wrote: > It would have been much more useful to get such messages into `journalctl -u pybal.service`'s output instead, and I do... [09:05:04] !log installing libsoup security updates on jessie/stretch [09:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:39] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:08:59] PROBLEM - Check systemd state on restbase-dev1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:09:45] (03PS1) 10Elukey: Add interface::add_ip6_mapped to druid, bohrium, thorium and meinerium [puppet] - 10https://gerrit.wikimedia.org/r/444819 (https://phabricator.wikimedia.org/T199180) [09:11:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) Last changed applied by Arzhel, including merging common-infrastructure4 to analytics-in4 [09:11:27] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: rename labvirt1021.eqiad.wmnet to cloudvirt1021.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/444584 (https://phabricator.wikimedia.org/T199107) [09:11:29] (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to druid, bohrium, thorium and meinerium [puppet] - 10https://gerrit.wikimedia.org/r/444819 (https://phabricator.wikimedia.org/T199180) (owner: 10Elukey) [09:12:19] RECOVERY - Check systemd state on restbase-dev1004 is OK: OK - running: The system is fully operational [09:13:19] (03PS4) 10Arturo Borrero Gonzalez: cloudvps: reimage and rename labvirt1021 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/444581 (https://phabricator.wikimedia.org/T199107) [09:13:54] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: rename labvirt1021.eqiad.wmnet to cloudvirt1021.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/444584 (https://phabricator.wikimedia.org/T199107) (owner: 10Arturo Borrero Gonzalez) [09:14:26] 10Operations, 10ops-esams, 10Epic: Remove all decommissioned hardware - https://phabricator.wikimedia.org/T184063 (10mark) [09:14:35] (03CR) 10Arturo Borrero Gonzalez: [C: 032] cloudvps: reimage and rename labvirt1021 to cloudvirt1021 [puppet] - 10https://gerrit.wikimedia.org/r/444581 (https://phabricator.wikimedia.org/T199107) (owner: 10Arturo Borrero Gonzalez) [09:17:42] (03PS3) 10ArielGlenn: quick script to show runtimes of dump jobs [dumps] - 10https://gerrit.wikimedia.org/r/444603 (https://phabricator.wikimedia.org/T199117) [09:25:28] !log swift eqiad-prod add ms-be1036 back gradually - T196873 [09:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:32] T196873: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873 [09:27:10] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:28:33] working on --^ [09:29:03] there seems to be some nagios check_disk processes in interruptible sleep, I think probably due to nfs or something similar [09:30:00] godog: applied everywhere? [09:31:25] mobrovac: yup should be rolled out everywhere now [09:33:26] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational [09:33:30] kk thnx! [09:34:36] !log forced umount of /mnt/hdfs on stat1004, several processes hang for it (causing load) and transport not connected [09:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:49] (03PS1) 10Elukey: profile::analytics::cluster::client: enable by default /mnt/hdfs check [puppet] - 10https://gerrit.wikimedia.org/r/444825 [09:40:08] (03PS2) 10Elukey: profile::analytics::cluster::client: enable by default /mnt/hdfs check [puppet] - 10https://gerrit.wikimedia.org/r/444825 [09:41:22] (03CR) 10Elukey: [C: 032] profile::analytics::cluster::client: enable by default /mnt/hdfs check [puppet] - 10https://gerrit.wikimedia.org/r/444825 (owner: 10Elukey) [09:42:26] PROBLEM - swift-object-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:42:26] PROBLEM - swift-account-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:42:26] PROBLEM - swift-container-server on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:42:35] PROBLEM - swift-account-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:42:35] PROBLEM - swift-object-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:42:45] PROBLEM - swift-container-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:42:46] (03PS1) 10Ema: varnish: improve wm_common_directors_init readability [puppet] - 10https://gerrit.wikimedia.org/r/444827 [09:42:46] PROBLEM - swift-container-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:42:55] PROBLEM - swift-object-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:42:56] PROBLEM - swift-account-reaper on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:42:56] PROBLEM - swift-container-auditor on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:43:05] PROBLEM - swift-object-updater on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:43:05] PROBLEM - swift-account-replicator on ms-be1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:43:49] oops, that's me [09:43:52] sorry about the spam [09:44:37] (03PS1) 10Muehlenhoff: Add library hint for exiv2 [puppet] - 10https://gerrit.wikimedia.org/r/444828 [09:45:38] (03PS2) 10Muehlenhoff: Add library hint for exiv2 [puppet] - 10https://gerrit.wikimedia.org/r/444828 [09:48:23] moritzm: I was wondering if we could use apt-cache depends in debdeploy to detect the related libraries, allowing to reduce a lot (if not all) the libraries hints [09:49:59] (03CR) 10Muehlenhoff: [C: 032] Add library hint for exiv2 [puppet] - 10https://gerrit.wikimedia.org/r/444828 (owner: 10Muehlenhoff) [09:51:14] 10Operations, 10media-storage: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) [09:51:15] volans: the query_deps command uses apt-cache rdepends, but for detecting restarts we need the specific sonames. ideally this would be part of dpkg meta data, I want to propose support for this upstream, but needs more work/thought [09:51:31] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) [09:52:17] and when there's a public release of debdeploy I'll simply ship it in the package as a starting point for others (most updates already reuse existing hints by now anyway) [09:52:26] moritzm: yeah, what I'm referring to is the 'exiv2 = libexiv2', that could be gathered by depends AFAIK, so for the simple cases it should work [09:52:34] no, that' [09:52:36] RECOVERY - swift-account-reaper on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:52:36] RECOVERY - swift-container-auditor on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:52:45] RECOVERY - swift-object-updater on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:52:45] RECOVERY - swift-account-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:52:55] PROBLEM - puppet last run on druid1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [09:53:06] no, that's not the package name, it's the soname of the library which isn't necessary identical. for src:exiv2 it is, but it's not a reliable heuristic [09:53:06] RECOVERY - swift-object-server on ms-be1040 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:53:06] (03PS2) 10Ema: varnish: improve wm_common_directors_init readability [puppet] - 10https://gerrit.wikimedia.org/r/444827 [09:53:15] RECOVERY - swift-account-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:53:15] RECOVERY - swift-container-server on ms-be1040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:53:16] RECOVERY - swift-account-auditor on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:53:16] RECOVERY - swift-object-auditor on ms-be1040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:53:26] RECOVERY - swift-container-updater on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:53:35] RECOVERY - swift-container-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:53:36] RECOVERY - swift-object-replicator on ms-be1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:56:46] PROBLEM - puppet last run on druid1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/10-puppet-agent.conf] [10:00:56] moritzm: sure, my thought was: maybe adding that heuristic in addition helps to reduce the list of manual hints required, not that will solve all of them [10:02:55] RECOVERY - puppet last run on druid1003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:03:47] !log restart analytics100[1,2]'s hadoop resource managers, some I/O socket errors after the ip6 interface change [10:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:59] wouldn't really help, we really want reliable data here. ideally a future dpkg will provide the meta data, but until then it's an okay interim solution [10:06:45] RECOVERY - puppet last run on druid1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:07:18] (03PS3) 10Ema: varnish: improve directors definition readability [puppet] - 10https://gerrit.wikimedia.org/r/444827 [10:07:20] !log restarting thumbor on thumbor1001 to pick up exiv2 security updates [10:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:46] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d724ad1]: Fix up the invalid Vary header [10:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:35] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on cp1053 is CRITICAL: 225 ge 4 Ema The host is depooled -- T165252 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=cp1053&var-datasource=eqiad%2520prometheus%252Fops [10:15:23] (03CR) 10Ema: [C: 031] Remove cp3048 prod DNS entries [dns] - 10https://gerrit.wikimedia.org/r/444560 (https://phabricator.wikimedia.org/T190607) (owner: 10Muehlenhoff) [10:20:36] !log ppchelko@deploy1001 deploy aborted: Fix up the invalid Vary header (duration: 12m 50s) [10:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:51] !log ppchelko@deploy1001 Started deploy [restbase/deploy@d724ad1]: Fix up the invalid Vary header, take 2, checer timed out [10:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:27] The main VisualEditor workboard is 404ing: https://phabricator.wikimedia.org/project/board/483/ [10:22:38] our other boards appear to be fine, and it's appearing in search [10:24:30] (03PS1) 10Mobrovac: RESTBase: Add Proton's URI [puppet] - 10https://gerrit.wikimedia.org/r/444835 (https://phabricator.wikimedia.org/T186748) [10:27:30] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@d724ad1]: Fix up the invalid Vary header, take 2, checer timed out (duration: 06m 39s) [10:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:05] (03CR) 10Mobrovac: "PCC ok - https://puppet-compiler.wmflabs.org/compiler02/11752/" [puppet] - 10https://gerrit.wikimedia.org/r/444835 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [10:36:33] (03PS1) 10Arturo Borrero Gonzalez: install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837 [10:37:15] (03PS2) 10Arturo Borrero Gonzalez: install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837 [10:38:14] (03CR) 10Giuseppe Lavagetto: [C: 031] varnish: improve directors definition readability [puppet] - 10https://gerrit.wikimedia.org/r/444827 (owner: 10Ema) [10:41:56] (03CR) 10Ema: [C: 032] varnish: improve directors definition readability [puppet] - 10https://gerrit.wikimedia.org/r/444827 (owner: 10Ema) [10:42:15] (03PS1) 10ArielGlenn: make stub dumps on a couple more wikis ordered by revisions within pages [puppet] - 10https://gerrit.wikimedia.org/r/444838 (https://phabricator.wikimedia.org/T29112) [10:50:57] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 56.24, 33.31, 22.32 [10:51:18] (03PS2) 10ArielGlenn: make stub dumps on a couple more wikis ordered by revisions within pages [puppet] - 10https://gerrit.wikimedia.org/r/444838 (https://phabricator.wikimedia.org/T29112) [10:52:37] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 65.83, 36.97, 23.31 [10:52:37] (03CR) 10ArielGlenn: [C: 032] make stub dumps on a couple more wikis ordered by revisions within pages [puppet] - 10https://gerrit.wikimedia.org/r/444838 (https://phabricator.wikimedia.org/T29112) (owner: 10ArielGlenn) [10:53:04] (03PS1) 10Giuseppe Lavagetto: apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839 [10:53:23] (03PS2) 10Muehlenhoff: Remove cp3048 prod DNS entries [dns] - 10https://gerrit.wikimedia.org/r/444560 (https://phabricator.wikimedia.org/T190607) [10:55:09] (03CR) 10Muehlenhoff: [C: 032] Remove cp3048 prod DNS entries [dns] - 10https://gerrit.wikimedia.org/r/444560 (https://phabricator.wikimedia.org/T190607) (owner: 10Muehlenhoff) [10:56:14] 10Operations, 10ops-esams, 10Traffic, 10Patch-For-Review: cp3048 hardware issues - https://phabricator.wikimedia.org/T190607 (10MoritzMuehlenhoff) [10:58:15] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 49.00, 33.59, 23.83 [10:58:30] 10Operations, 10Analytics, 10hardware-requests: eqiad: (2) hardware refresh for analytics1003 - https://phabricator.wikimedia.org/T198685 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:58:36] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 61.06, 41.88, 27.14 [10:59:08] 10Operations, 10Traffic: Setup wikimediafoundation.org domain for July 30 launch of new site - https://phabricator.wikimedia.org/T198922 (10MoritzMuehlenhoff) [10:59:21] 10Operations, 10ops-esams: Degraded RAID on cp3048 - https://phabricator.wikimedia.org/T198784 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:59:46] 10Operations, 10procurement: leases expiring on labvirt1010 and 1011 - https://phabricator.wikimedia.org/T198762 (10MoritzMuehlenhoff) p:05Triage>03Normal [10:59:56] (03PS3) 10Arturo Borrero Gonzalez: openstack: bootstrap: neutron: refresh and add more hints [puppet] - 10https://gerrit.wikimedia.org/r/444222 (https://phabricator.wikimedia.org/T196633) [11:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1100). [11:00:04] dcausse and Aaron Schulz: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:40] dcausse, AaronSchulz: want to deploy your own patches? [11:01:26] o/ [11:01:31] zeljkof: sure [11:01:39] I suppose [11:02:02] dcausse, AaronSchulz: swat is yours then :) [11:02:13] self-organize and go ahead [11:02:26] ok starting to deploy mine (it should be quick, it's just a cleanup) [11:02:53] I am around if needed [11:02:55] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444576 (owner: 10DCausse) [11:03:27] 10Operations, 10ops-eqiad: Relabel labvirt1021.eqiad.wmnet as cloudvirt1021.eqiad.wmnet - https://phabricator.wikimedia.org/T199132 (10aborrero) The server was reimaged+renamed and I just upgraded the racktables record. [11:03:36] PROBLEM - SSH on ms-be1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:03:43] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Other differences with existing machines include the fact that the last batch has been installed with stretch from the get go as opposed to jessie,... [11:04:26] RECOVERY - SSH on ms-be1034 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [11:04:39] (03Merged) 10jenkins-bot: [cirrus] cleanup unused config vars 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444576 (owner: 10DCausse) [11:05:54] (03PS1) 10Ema: varnish: avoid adding vtc_backend multiple times [puppet] - 10https://gerrit.wikimedia.org/r/444840 [11:06:08] (03PS2) 10Giuseppe Lavagetto: apache-fast-test: exit with non-zero exit codes if some issues are found [puppet] - 10https://gerrit.wikimedia.org/r/444839 [11:06:13] (03PS1) 10ArielGlenn: one more wiki with order by revs for stubs dump [puppet] - 10https://gerrit.wikimedia.org/r/444841 (https://phabricator.wikimedia.org/T29112) [11:07:30] (03CR) 10ArielGlenn: [C: 032] one more wiki with order by revs for stubs dump [puppet] - 10https://gerrit.wikimedia.org/r/444841 (https://phabricator.wikimedia.org/T29112) (owner: 10ArielGlenn) [11:07:42] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, 10Patch-For-Review: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531 (10abian) wikiba.se is a bit unstable. Today has been down for some hours (from ~1:00 UTC to ~5:30 UTC). Last issues were detected on... [11:08:13] (03CR) 10jenkins-bot: [cirrus] cleanup unused config vars 1/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444576 (owner: 10DCausse) [11:09:26] !log dcausse@deploy1001 Synchronized ./wmf-config/: [cirrus] cleanup unused config vars 1/2 (duration: 00m 53s) [11:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:38] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse) [11:11:23] (03CR) 10Ema: [C: 032] "pcc output looks good: https://puppet-compiler.wmflabs.org/compiler02/11753/" [puppet] - 10https://gerrit.wikimedia.org/r/444840 (owner: 10Ema) [11:11:31] (03PS2) 10Ema: varnish: avoid adding vtc_backend multiple times [puppet] - 10https://gerrit.wikimedia.org/r/444840 [11:11:46] (03PS1) 10ArielGlenn: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/444850 [11:12:36] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 8.22, 18.40, 23.95 [11:13:21] 10Operations, 10ops-eqiad: Relabel labvirt1022.eqiad.wmnet as cloudvirt1022.eqiad.wmnet - https://phabricator.wikimedia.org/T199203 (10aborrero) [11:16:25] hashar: should I wait or rebase (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/444577/) I don't get the status of this patch here [11:16:50] (03PS14) 10Ema: cache_text: add support for alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443906 (https://phabricator.wikimedia.org/T164609) [11:16:52] (03PS14) 10Ema: cache_text: add misc directors and alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443907 (https://phabricator.wikimedia.org/T164609) [11:16:54] (03PS9) 10Ema: cache_text: load misc VCL as wikimedia_misc in VTC files [puppet] - 10https://gerrit.wikimedia.org/r/443930 (https://phabricator.wikimedia.org/T164609) [11:16:56] (03PS7) 10Ema: cache_text: add misc-specific VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/443974 (https://phabricator.wikimedia.org/T164609) [11:17:25] (03CR) 10TerraCodes: "Since I3773876fa7aa9205a5ea98cbbbdecaef9c06ff81 is deployed, should this patch be abandoned?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420948 (owner: 10Niharika29) [11:18:40] (03PS3) 10DCausse: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 [11:18:43] hashar: with the new gerrit ui it's not clear that it needs a rebase (with the old one I see "Cannot merge") [11:19:20] (03CR) 10DCausse: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse) [11:20:43] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse) [11:21:25] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 13.71, 16.06, 23.76 [11:21:55] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 14.06, 21.31, 29.61 [11:22:27] (03Merged) 10jenkins-bot: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse) [11:22:39] (03CR) 10jenkins-bot: [cirrus] cleanup unused config vars 2/2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444577 (owner: 10DCausse) [11:24:05] (03PS3) 10Aaron Schulz: Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239) [11:24:57] !log installing PHP 7 security updates on stretch [11:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:06] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-compute service [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633) [11:25:59] !log dcausse@deploy1001 Synchronized ./wmf-config/InitialiseSettings.php: [cirrus] cleanup unused config vars 2/2 (duration: 01m 40s) [11:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:31] AaronSchulz: I'm done, please go ahead [11:26:49] ok [11:27:00] (03CR) 10Aaron Schulz: [C: 032] Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [11:28:35] (03Merged) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [11:32:14] (03CR) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440470 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [11:32:50] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Make all non-test wikis write to both nutcracker and mcrouter (duration: 00m 51s) [11:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:25] Since a few minutes I am gettings Fatals Error "Typs „Exception“ on Wikimedia Commons while saving [11:36:27] [W0SZmgpAAD4AAFHXWLcAAACR] 2018-07-10 11:33:46: Fataler Ausnahmefehler des Typs „Exception“ [11:38:35] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 13.75, 15.22, 29.89 [11:39:56] PROBLEM - DPKG on mw2250 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:40:05] hashar: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/SpamBlacklist/+/444856/ [11:42:15] RECOVERY - DPKG on mw2250 is OK: All packages OK [11:43:06] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [11:43:35] !log updated compiler facts `PUPPET_MASTERS=puppetmaster1001.eqiad.wmnet PUPPET_COMPILER=compiler02.puppet3-diffs.eqiad.wmflabs modules/puppet_compiler/files/compiler-update-facts` [11:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:20] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-compute service [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633) [11:50:30] zeljkof: happy training today :D [11:50:54] addshore: thanks :P [11:51:01] enjoying it [11:53:16] !log ppchelko@deploy1001 Started deploy [restbase/deploy@2674971]: Roll out cassandra-driver@3.5.0 to restbase2001 [11:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:20] AaronSchulz: is that change going to be backported or the other patch reverted? [11:53:55] PROBLEM - Memcached on labtestweb2001 is CRITICAL: connect to address 208.80.153.14 and port 11000: Connection refused [11:54:13] addshore: backported [11:54:19] ack :) [11:54:38] addshore: want to CR for master? [11:55:18] can do [11:56:09] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@2674971]: Roll out cassandra-driver@3.5.0 to restbase2001 (duration: 02m 52s) [11:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:05] Why did that suddenly break? [11:58:16] AaronSchulz: +2ed on master, also https://phabricator.wikimedia.org/T199216 if you want to use the bug number in SAL entries :) [11:58:24] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: enable nova-compute service [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633) [11:59:11] 11:32 aaron@deploy1001: Synchronized wmf-config/mc.php: Make all non-test wikis write to both nutcracker and mcrouter (duration: 00m 51s) [11:59:13] Reedy: ^^ [11:59:24] lol [11:59:57] https://phabricator.wikimedia.org/T199039 [12:00:04] It's not the only on [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1200) [12:00:06] one [12:00:10] Translate needs fixing it seems [12:00:18] addshore: that didn't used to be an exception back when, just a warning [12:00:50] might be better to just revert [12:00:58] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11759/ compiler is good" [puppet] - 10https://gerrit.wikimedia.org/r/444855 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:01:19] addshore: maybe, waiting on jenkins is a tad slow [12:02:37] (03PS1) 10Aaron Schulz: Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862 [12:02:37] (03CR) 10Aaron Schulz: [C: 032] Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862 (owner: 10Aaron Schulz) [12:04:17] (03Merged) 10jenkins-bot: Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862 (owner: 10Aaron Schulz) [12:05:46] RECOVERY - Memcached on labtestweb2001 is OK: TCP OK - 0.036 second response time on 208.80.153.14 port 11000 [12:05:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:05:50] AaronSchulz: addshore there's at least two other bugs like this already reported that haven't been fixed... [12:06:39] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Revert "Make all non-test wikis write to both nutcracker and mcrouter" (duration: 00m 56s) [12:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:09] https://gerrit.wikimedia.org/r/444863 / https://phabricator.wikimedia.org/T199039 [12:08:15] And also https://phabricator.wikimedia.org/T199218 [12:08:54] (03CR) 10jenkins-bot: Revert "Make all non-test wikis write to both nutcracker and mcrouter" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444862 (owner: 10Aaron Schulz) [12:10:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:11:18] hmm, the RemoteFileDescription was already fixed in master it seems [12:11:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:20:36] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/SpamBlacklist/includes/SpamBlacklist.php: 583dc7a92f9b (duration: 00m 51s) [12:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:01] 10Operations, 10MediaWiki-Configuration: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10dcausse) [12:23:36] PROBLEM - puppet last run on mw1347 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [12:24:31] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/extensions/SpamBlacklist/includes/SpamBlacklist.php: 08a2153f7aa (duration: 00m 51s) [12:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:48] 10Operations, 10MediaWiki-Configuration: Cleanup cirrus keys in $wmfSwiftEqiadConfig - https://phabricator.wikimedia.org/T199220 (10dcausse) p:05Triage>03Low [12:25:24] enough of that for now [12:25:46] * AaronSchulz wonders if something is up with makeKeyInternal() [12:27:42] Reedy: so the main problem is 61a7e1acd0af4a5386df03335733accfde179fa1 is not yet being in wmf10 [12:27:52] that commit seemed like ages ago [12:28:09] Haha [12:28:18] 13 days is nearly 2 weeks [12:28:34] the other key stuff is low priority cleanup [12:30:48] heh, the backport was abandoned...may as well restore and merge [12:31:49] though technically nobody should deploy anything now [12:32:09] zeljkof: are you doing the train later? [12:32:25] He's AFK for 30 [12:34:20] I'll just backport that one quickly [12:35:50] technically group0 wikis are bugged without it (though they are en sites with low traffic so the error rate is low) [12:35:57] AaronSchulz: yes [12:38:41] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10Vgutierrez) IMHO it would be great if those messages were in stderr instead of stdout otherwise you miss them when using journalctl + grep, that's why we missed... [12:38:54] zeljkof: so nothing is happening during "MediaWiki train - Americas version" today right? [12:39:37] AaronSchulz: as far as I know, correct [12:39:52] hmm, perhaps I can retry the mc part then [12:41:23] (03PS1) 10Jcrespo: mariadb: Depool db1086 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444869 [12:41:43] AaronSchulz: I guess after backporting your change and merging it... Just check the logs for generic key errors first [12:46:38] (03PS1) 10Jcrespo: mariadb: Allow reimage of db108X servers [puppet] - 10https://gerrit.wikimedia.org/r/444870 [12:49:05] RECOVERY - puppet last run on mw1347 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:52:26] (03PS1) 10Vgutierrez: authdns: Replace baham with authdns2001 [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664) [12:53:22] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/includes/libs/objectcache/ReplicatedBagOStuff.php: 4ad6b70ba132c66e14a706eae240887885946a42 (duration: 00m 51s) [12:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:55] (03PS2) 10Vgutierrez: authdns: Add authdns2001 to the list of authdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664) [12:59:57] (03PS1) 10Vgutierrez: authdns: Remove baham from the authdns host list [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664) [13:00:04] zeljkof: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1300). [13:01:07] (03CR) 10Vgutierrez: [C: 04-1] "To be merged after testing the syncing between the 4 dns servers and" [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez) [13:01:09] (03PS5) 10Muehlenhoff: Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) [13:02:26] (03CR) 10Muehlenhoff: [C: 032] Allow removing Diamond gradually [puppet] - 10https://gerrit.wikimedia.org/r/429389 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [13:06:48] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db108X servers [puppet] - 10https://gerrit.wikimedia.org/r/444870 (owner: 10Jcrespo) [13:06:54] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) I have stopped swift/rsync on `ms-be1040` to inspect further, taking e.g. `sde` and running `xfs_repair -n` on it mentioned a discrepancy in free b... [13:06:56] (03PS2) 10Jcrespo: mariadb: Allow reimage of db108X servers [puppet] - 10https://gerrit.wikimedia.org/r/444870 [13:07:35] (03PS1) 10Andrew Bogott: nova: change to api manager override [puppet] - 10https://gerrit.wikimedia.org/r/444878 (https://phabricator.wikimedia.org/T198950) [13:09:06] (03PS2) 10Andrew Bogott: nova: change to api manager override [puppet] - 10https://gerrit.wikimedia.org/r/444878 (https://phabricator.wikimedia.org/T198950) [13:10:05] (03CR) 10Andrew Bogott: [C: 032] nova: change to api manager override [puppet] - 10https://gerrit.wikimedia.org/r/444878 (https://phabricator.wikimedia.org/T198950) (owner: 10Andrew Bogott) [13:10:31] (03CR) 10Thcipriani: [C: 032] Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani) [13:12:16] (03Merged) 10jenkins-bot: Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani) [13:12:41] (03CR) 10jenkins-bot: Scap clean: remove remote cache directory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441920 (https://phabricator.wikimedia.org/T157030) (owner: 10Thcipriani) [13:12:43] (03PS2) 10Daniel Kinzler: wgMultiContentRevisionSchemaMigrationStage MIGRATION_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [13:14:59] !log thcipriani@deploy1001 Synchronized scap/plugins/clean.py: no op sync for consistancy [[gerrit:441920|Scap clean: remove remote cache directory]] (duration: 00m 51s) [13:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:20] (03PS1) 10Nehajha: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) [13:17:40] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Graphs of disk usage for the remaining affected hosts: {F23554534} {F23554533} {F23554532} [13:25:39] !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.4 (duration: 06m 18s) [13:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:57] !log installing cups security updates on jessie [13:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:41] !log thcipriani@deploy1001 Pruned MediaWiki: 1.32.0-wmf.5 (duration: 03m 13s) [13:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:13] (03PS2) 10Daniel Kinzler: MCR DNM Enable MCR write-both mode on commons beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442918 (https://phabricator.wikimedia.org/T197818) [13:30:15] zeljkof: ok, done cleaning for now, should be good, sorry to take up so much time [13:30:56] thcipriani: thanks for the help! [13:35:40] thcipriani, hashar: it might be obvious to some, but not to me :) I'm here https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_patches_to_update_wikiversions.json [13:36:03] and I'm not sure from where should I run `scap update-wikiversions group0 VERSION ` [13:36:29] I guess from deploy1001 machine, but which folder? [13:36:42] in deploy1001:/srv/mediawiki-staging [13:36:43] /srv/mediawiki-staging [13:37:28] ah, thanks, just saw it in the next step :) will update the docs [13:38:27] thcipriani, hashar: it would be great if you could take a look at the docs a few times today, I have been updating them as I go along, to make sure I did not misunderstood something and broke the docs :) [13:38:38] sure :) [13:38:57] thanks [13:39:56] (03CR) 10Zhuyifei1999: Providing users more clue when kuberenetes is unable to delete all the objects (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha) [13:40:34] !log rolling restart of thumbor to pick up new exiv2/openssl [13:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:26] (03PS2) 10Nehajha: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) [13:45:54] ok, hopefully this is ok https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Create_patches_to_update_wikiversions.json [13:45:57] doing it [13:48:28] !log installing subversion updates from jessie 9.11 point release [13:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:15] (03CR) 10Aftab: [C: 031] "ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [13:52:23] (03PS1) 10Andrew Bogott: nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886 [13:53:00] !log installing ruby-sprockets security updates [13:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:15] (03CR) 10jerkins-bot: [V: 04-1] nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886 (owner: 10Andrew Bogott) [13:53:27] (03PS1) 10Zfilipin: Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 [13:59:58] (03CR) 10Thcipriani: [C: 031] Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin) [14:01:55] (03PS2) 10Andrew Bogott: nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886 [14:02:00] zeljkof: I went hunting a chromium bug this afternoon sorry :D [14:02:10] thcipriani: ok, did everything up to `scap clean` [14:02:48] since you did it, I can skip it, right? for future reference, it also runs from `you@deploy1001:/srv/mediawiki-staging`? [14:03:00] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683 (10Qgil) a:03Qgil At the end we have decided to stop using this list. [14:03:06] hashar: bad time for bug hunting! ;) [14:03:15] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683 (10Qgil) [14:03:19] I _think_ things are fine so far [14:03:35] nobody is screaming at me. yet [14:03:41] zeljkof: yeah, it runs from /srv/mediawiki-staging I did some cleanup, still more to do, but it should be fine to skip for now [14:03:59] thcipriani: ok, updating docs and skipping this time [14:04:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) [14:04:53] hey, I have couple NO-OP config changes to deploy - it's removeing the unused config variables - what is the protocol here? [14:05:25] SWAT? :) [14:05:45] do I need to do that during SWAT windows? [14:05:59] You dont have to. But that's probably the easiest [14:06:03] example: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/444591/ [14:06:29] Do you have deploy access? [14:06:31] (03CR) 10Andrew Bogott: [C: 032] nova: allow neutron hosts to access metadata service [puppet] - 10https://gerrit.wikimedia.org/r/444886 (owner: 10Andrew Bogott) [14:06:32] yes [14:06:37] raynor: please don't do it now, since I'm lost in the middle of my very first train :) [14:06:53] If you've got quite a few cleanup patches to deploy, you can find a deploy window yourself [14:07:24] I can do that during swat, but because those are noop I didn't want to block the SWAT window [14:08:07] zeljkof: don't worry, now I'm just asking for info when should I do that [14:08:40] raynor: any swat should do, even taking the entire window should be fine [14:08:49] maybe I can do everything like after a SWAT window, lets say tomorrow mid-day [14:08:52] or just pick a time with no deployments [14:09:30] today is a train day, I'll do that tomorrow before mid-day SWAT [14:09:35] raynor: just make sure greg-g knows about it, he is probably go-to person [14:09:53] ok, thx [14:15:08] !log installing tiff security updates on jessie [14:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:45] !log zfilipin@deploy1001 Started scap: testwiki to php-1.32.0-wmf.12 and rebuild l10n cache [14:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:41] 10Operations, 10ops-codfw, 10DC-Ops: Replace disk on wasat - https://phabricator.wikimedia.org/T197562 (10ArielGlenn) /dev/sda is just not showing up; that should be the disk that was replaced in T193394 ``` root@wasat:/var/log# ls -l /dev/disk/by-id total 0 lrwxrwxrwx 1 root root 9 Mar 6 08:39 ata-MM0500... [14:22:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: snapshot1005 does not power back up - https://phabricator.wikimedia.org/T198792 (10Cmjohnson) I attempted to power off, unplug and power the server back on, unfortunately it does not want to power on...i just get a flashing gre... [14:23:39] hashar, thcipriani: is this stuck or just taking a long time, it's 5 minutes so far... `14:18:06 Updating LocalisationCache for 1.32.0-wmf.10 using 30 thread(s)` [14:24:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation, 10Patch-For-Review: snapshot1005 does not power back up - https://phabricator.wikimedia.org/T198792 (10ArielGlenn) The service was moved to another host already (which had been the spare) but we do still want a spare. February is not that far away... [14:24:17] zeljkof: you're going to be staring at that for a little while yet, that's the most time-intensive part [14:24:55] thcipriani: ah, time for a quick break then :) should I use `screen` for that one, since it's lost? [14:25:01] (for future reference) [14:25:15] (03CR) 10Gehel: "Looks reasonable to me. One not entirely minor point is the communication from analytics cluster, which does not go through LVS, but talks" [puppet] - 10https://gerrit.wikimedia.org/r/444610 (owner: 10EBernhardson) [14:25:15] I think on the new machine it's been taking like 20 minutes or so [14:25:18] "long", not "lost" [14:25:36] zeljkof: yeah, I tend to do the entirety of train in a tmux sessions [14:25:46] s/s$// [14:26:07] thcipriani: hm, maybe we should make it explicit in the docs... [14:27:20] sure probably a good idea: start a screen or tmux session at the start of the window, do all this there to make your life easier [14:31:12] !log Set disk #0 offline for replacement - T199056 [14:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:19] T199056: db1069 bad disk - https://phabricator.wikimedia.org/T199056 [14:31:35] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) I've posted to `linux-xfs` mailing list to ask if someone has run into this bug before and how to debug further. Regardless of whether we can succ... [14:32:03] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) @Cmjohnson disk #0 is now offline, feel free to replace it when you can. [14:34:22] 10Operations, 10ops-eqiad: Relabel labvirt1022.eqiad.wmnet as cloudvirt1022.eqiad.wmnet - https://phabricator.wikimedia.org/T199203 (10Cmjohnson) 05Open>03Resolved [14:35:24] thcipriani: any advice for us mere mortals that don't use screen/tmux much? just `screen` or `tmux` at the start, appropriate command at the end? any command line flags recommended? [14:35:29] 10Operations, 10ops-eqiad: Relabel labvirt1021.eqiad.wmnet as cloudvirt1021.eqiad.wmnet - https://phabricator.wikimedia.org/T199132 (10Cmjohnson) 05Open>03Resolved [14:37:04] I use: tmux new -s 'train'; from there I do my work in the window. When I'm done I `exit` until I'm off the server. If you need to leave in the middle you can do: ctrl-b d to detach [14:38:43] thcipriani: sounds good to me! I'll add it to the docs :) [14:39:32] for screen, I guess I'd do: screen -D -RR train; and ctrl-a d to detach. I don't remember why I do -D -RR, but that's the flags I've memorized :) [14:39:47] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) Disk replaced by Chris, let's see if this time it turns out fine! ``` root@db1069:~# megacli -PDRbld -ShowProg -PhysDrv [32:0] -a0 Rebuild Progress on Device at Enclosure 32, Slot 0 Completed 1% in 1... [14:46:07] thcipriani: I'll add it to the docs, I'm sure it will get fixed soon if it's not the correct way to do it :) [14:50:45] RECOVERY - Device not healthy -SMART- on db1069 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [14:54:36] PROBLEM - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [14:54:36] ACKNOWLEDGEMENT - MegaRAID on db1069 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T199232 [14:54:52] 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T199232 (10ops-monitoring-bot) [14:57:27] 10Operations, 10ops-eqiad: Degraded RAID on db1069 - https://phabricator.wikimedia.org/T199232 (10Marostegui) [14:57:29] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) [15:00:55] thcipriani: looks good? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#tmux_or_screen [15:01:08] * thcipriani looks [15:01:30] bd [15:01:45] up to `sync-apaches: 0% (ok: 0; fail: 0; left: 264)` [15:01:49] but stuck there :/ [15:01:58] How long for? [15:04:28] Reedy: not stuck any more, up to 50% [15:04:37] Did it report any errors? [15:04:41] thcipriani: bd? [15:04:57] Reedy: no so far, as far as I can see [15:05:02] heh, bd <- two thumbs up :) [15:05:09] thcipriani: :D [15:09:15] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394 (10RobH) @volans: Are you handling the reimagine? This host is still email spamming about the defunct disk. There is also related/duplicate task T197562, [15:09:40] 10Operations, 10ops-codfw, 10DC-Ops: Replace disk on wasat - https://phabricator.wikimedia.org/T197562 (10RobH) 05Open>03Invalid This is indeed a dupe of T193394 which has some info. Closing this as a dupe. [15:10:37] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 49.74, 25.55, 18.47 [15:10:57] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394 (10Volans) @RobH No, not in my plate, I was told it was about to be reimaged, I think that @MoritzMuehlenhoff and @elukey might have more info about this. [15:12:47] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 24.22, 24.51, 19.07 [15:13:11] 10Operations, 10Wikimedia-Mailing-lists: Create dedicated mailing list for schema changes, API changes, and other things affecting tool maintainers - https://phabricator.wikimedia.org/T199234 (10MusikAnimal) [15:15:56] thcipriani, hashar: I have just noticed that train window is officially over, but I'm still at https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki [15:16:15] Keep going :) [15:16:33] this is running for the last hour or so :/ `zfilipin@deploy1001:/srv/mediawiki-staging$ scap sync "testwiki to php-1.32.0-wmf.12 and rebuild l10n cache"` [15:17:04] the first time `14:17:45 Started scap: testwiki to php-1.32.0-wmf.12 and rebuild l10n cache` [15:17:23] the last, and still not over `15:08:11 Started scap-cdb-rebuild` [15:18:02] That's gonna take a while [15:18:42] yeah, don't stop, the next thing is puppetswat with nothing scheduled [15:19:13] !log zfilipin@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.12 and rebuild l10n cache (duration: 61m 27s) [15:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:23] oh, this step is finally done :) [15:19:28] heh [15:19:53] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio) [15:27:29] (03PS15) 10Ema: cache_text: add support for alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443906 (https://phabricator.wikimedia.org/T164609) [15:28:28] (03CR) 10Ema: [C: 032] cache_text: add support for alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443906 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [15:28:53] (03PS15) 10Ema: cache_text: add misc directors and alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443907 (https://phabricator.wikimedia.org/T164609) [15:29:40] (03CR) 10Ema: [C: 032] cache_text: add misc directors and alternate_domains [puppet] - 10https://gerrit.wikimedia.org/r/443907 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [15:33:54] thcipriani, hashar, greg-g, Reedy: stupid question, but how do I check l10n cache? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Sync_to_cluster_and_verify_on_testwiki [15:33:59] 10Operations, 10monitoring, 10User-fgiunchedi: Handle SMART for multiple shelves attached to a single smartarray controller - https://phabricator.wikimedia.org/T199236 (10fgiunchedi) p:05Triage>03Normal [15:34:04] zeljkof: Visit the wiki [15:34:08] Check the messages are there [15:34:09] ok [15:34:14] And not [15:34:26] ah, just testwiki should look ok, not broken :) [15:34:35] Yeah [15:34:38] Rather than like https://en.wikipedia.org/wiki/?uselang=qqx [15:35:05] thanks! :) [15:35:16] (will update the docs, for the next clueless deployer) [15:38:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) [15:38:24] (03PS3) 10Vgutierrez: [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 [15:38:27] PROBLEM - eventstreams on scb2003 is CRITICAL: connect to address 10.192.0.33 and port 8092: Connection refused [15:38:28] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/definition/{title}{/revision}{/tid} (retriev [15:38:28] initions for cat) timed out before a response was received [15:38:38] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Cmjohnson) lvs1015 idrac is setup, I think it's cabled correctly but I am not really sure, enp4s0f1 doesn't translate for me looking at h/w but I am pretty sure it matches... [15:38:47] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received [15:39:18] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) [15:39:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [15:39:28] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [15:39:28] RECOVERY - eventstreams on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.089 second response time [15:39:47] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:40:30] mobrovac: troubles with restbase? [15:43:25] ah no, ignore ema [15:44:07] Is that advice for life? [15:44:15] 'ignore ema' [15:44:29] it isn't bad advice really! :) [15:44:57] RECOVERY - MegaRAID on db1069 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [15:46:33] 10Operations, 10ops-eqiad, 10Traffic, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293 (10Vgutierrez) @Cmjohnson take into account that eth0 should be enp4s0f0, not enp4s0f1 :) BTW, would you mind checking the ethernet firmware version and update them if neede... [15:48:29] friendly reminder: we are going to replace baham.w.o with authdns2001.w.o in a few minutes (16:00 UTC), please don't merge any operations/dns commits till we are done, thanks! :) [15:52:07] PROBLEM - Varnish frontend child restarted on cp3033 is CRITICAL: 16 gt 3 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [15:52:12] (03PS1) 10WMDE-Fisch: Enable FileExporter for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444901 (https://phabricator.wikimedia.org/T198594) [15:53:06] ema ^^ [15:53:28] vgutierrez: yup, looking [15:53:34] <3 [15:56:21] (03PS4) 10Vgutierrez: [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 [15:57:25] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) I added the IPv6 equivalent of the v4 filter with a default "log+permit" term, so we can see if we missed anything. 3 highlig... [15:57:47] (03CR) 10jerkins-bot: [V: 04-1] [WIP] get rid of openssl CLI usage [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [15:58:28] thcipriani, hashar, Reedy: um, so I put my mediawiki username/password in the file? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Update_deploy_notes [15:58:43] I have 2FA, would that even work? [15:59:00] BotPasswords! [15:59:02] I think... [16:00:04] godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1600). Please do the needful. [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:07] I've not run that script in ages [16:02:43] (03CR) 10Vgutierrez: [C: 032] authdns: Add authdns2001 to the list of authdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez) [16:02:54] (03PS3) 10Vgutierrez: authdns: Add authdns2001 to the list of authdns hosts [puppet] - 10https://gerrit.wikimedia.org/r/444872 (https://phabricator.wikimedia.org/T196664) [16:03:31] Reedy: thanks! looking for botpasswords [16:03:40] !log replacing baham with authdns2001 - T196664 [16:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:43] T196664: rack/setup/install authdns2001.wikimedia.org - https://phabricator.wikimedia.org/T196664 [16:04:04] marxarelli: around for a little help with this? :) https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Update_deploy_notes [16:04:53] zeljkof: so I wrote a replacement for update-deploy-notes that I've been using, but I know that others who do the train still use the php script [16:05:05] it's in the tools-release repo [16:05:36] makedeploynotes.py [16:05:56] I don't have a preference, but I don't think my password would work, because 2FA [16:06:07] this one doesn't upload for you :) [16:06:09] makedeploynotes.py 1.32.0-wmf.10 1.32.0-wmf.12 | tee deploy-notes-1.32.0-wmf.12 [16:06:35] and then just copy and paste from that file to https://www.mediawiki.org/wiki/MediaWiki_1.32/wmf.12/Changelog [16:06:36] thcipriani: that's cool, I can copy/paste like Bruce Lee! :) [16:08:22] thcipriani: can I just update the docs to use that? :) seems more sane and secure [16:08:23] added bonus: also you don't need to update your local checkout since that one queries gitiles [16:08:40] (03PS6) 10Andrew Bogott: prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio) [16:08:47] PROBLEM - Check systemd state on cp1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:09:30] 10Operations, 10ops-codfw: Degraded RAID on wasat - https://phabricator.wikimedia.org/T193394 (10ArielGlenn) So the new disk is showing up as /dev/sdc now, presumably a reimage would straighten everything out. [16:09:48] zeljkof: sure, that's the one I've been using for a while. Not sure what marxarell.i and twentyafterfou.r do. [16:10:07] PROBLEM - Check systemd state on cp3033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:10:08] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3121: Connection refused [16:10:16] thcipriani: old docs will be in history, in case they want to go back :) [16:10:17] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3127: Connection refused [16:10:32] thcipriani: ok, I'll do my best to update the docs, then ask you to review, sounds good? [16:10:52] sure [16:11:08] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3126: Connection refused [16:11:25] thcipriani: thanks! sorry, it's been a long day and I am totally confused with the process [16:11:27] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3120: Connection refused [16:11:27] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3124: Connection refused [16:11:27] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3125: Connection refused [16:11:45] this copy/paste your password thing really freaks me out [16:12:25] cp3033 above is me ^ [16:12:28] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3123: Connection refused [16:12:37] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 3122: Connection refused [16:12:47] PROBLEM - Varnish HTTP text-frontend - port 80 on cp3033 is CRITICAL: connect to address 10.20.0.168 and port 80: Connection refused [16:13:03] thcipriani: so no need to clone core, right? [16:13:17] for the python notes script? [16:13:22] nope, can be run from anywhere [16:13:30] thcipriani: yes, great, ok [16:13:33] removing from docs [16:13:39] and from my home folder :P [16:14:03] zeljkof: and no worries about the questions. I am fully aware of the potential negative effects the train process can have on one's psyche. [16:14:11] :) [16:14:14] zeljkof: you're in good hands with thcipriani :) [16:14:14] :D [16:14:23] * marxarelli makes his morning coffee [16:14:29] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10Addshore) The WMDE scripts have requests going to the following places not via the webproxy: - https://noc.wikimedia.org/conf/dblists/... [16:14:43] (03CR) 10Andrew Bogott: [C: 032] prometheus: tools: scrape paws metrics into prometheus [puppet] - 10https://gerrit.wikimedia.org/r/441514 (https://phabricator.wikimedia.org/T195030) (owner: 10Chico Venancio) [16:15:04] I'll polish to docs so the next person will just have to copy/paste ;) [16:15:06] (03PS1) 10Ema: Revert "cache_text: add support for alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/444903 [16:15:11] zeljkof: seriously though, i can help with the process this week as well [16:16:18] (03PS2) 10Ema: Revert "cache_text: add support for alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/444903 [16:17:15] (03CR) 10Ema: [C: 032] Revert "cache_text: add support for alternate_domains" [puppet] - 10https://gerrit.wikimedia.org/r/444903 (owner: 10Ema) [16:20:28] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.171 second response time [16:20:47] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [16:20:47] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [16:20:47] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.168 second response time [16:20:48] RECOVERY - Varnish frontend child restarted on cp3033 is OK: (C)3 gt (W)1 gt 1 https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp3033&var-datasource=esams+prometheus/ops [16:21:08] RECOVERY - Check systemd state on cp3033 is OK: OK - running: The system is fully operational [16:21:48] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.168 second response time [16:21:57] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [16:22:07] RECOVERY - Varnish HTTP text-frontend - port 80 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 503 bytes in 0.167 second response time [16:22:37] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 501 bytes in 0.167 second response time [16:22:38] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp3033 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.167 second response time [16:22:46] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10elukey) >>! In T198623#4412494, @Addshore wrote: > The WMDE scripts have requests going to the following places not via the webproxy: >... [16:22:56] (03CR) 10EBernhardson: [C: 04-1] "I took a look into the analytics cluster part, what we can do is utilize the HostHeaderSSLAdapter from requests_toolbelt to connect to the" [puppet] - 10https://gerrit.wikimedia.org/r/444610 (owner: 10EBernhardson) [16:23:24] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops, 10Patch-For-Review: Review analytics-in4/6 rules on cr1/cr2 eqiad - https://phabricator.wikimedia.org/T198623 (10ayounsi) And more redundant, as query.wikidata.org and wikidata.org are load balanced. [16:25:10] thcipriani: looks good? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Update_deploy_notes [16:25:18] RECOVERY - Check systemd state on cp1008 is OK: OK - running: The system is fully operational [16:25:26] thcipriani: diff https://wikitech.wikimedia.org/w/index.php?title=Heterogeneous_deployment%2FTrain_deploys&type=revision&diff=1796714&oldid=1796711 [16:26:49] zeljkof: lgtm, although I've been running this from my laptop [16:27:03] thcipriani: ah [16:27:40] hm, for simpler docs, let's just assume everything is at deployment server [16:27:47] and then people might do as they want [16:27:52] I might make a note [16:28:10] good point, no need for this to be there [16:28:23] (03PS1) 10Pmiazga: Enable page previews for all new editors [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) [16:28:40] (03CR) 10Vgutierrez: [C: 032] "baham depooled successfully :)" [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664) (owner: 10Vgutierrez) [16:28:47] thcipriani: also, `on Windows _netrc` makes no sense and can be removed, right? https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Setup [16:28:52] (03PS2) 10Vgutierrez: authdns: Remove baham from the authdns host list [puppet] - 10https://gerrit.wikimedia.org/r/444875 (https://phabricator.wikimedia.org/T196664) [16:29:22] zeljkof: sure, I think it must've been copy and paste from some other instructions somewhere [16:29:43] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: create profile [puppet] - 10https://gerrit.wikimedia.org/r/444908 [16:29:45] (03PS1) 10Giuseppe Lavagetto: apache-fast-test: read files from the tests directory as a fallback [puppet] - 10https://gerrit.wikimedia.org/r/444909 [16:29:47] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::web_testing: add script for deploying apache changes [puppet] - 10https://gerrit.wikimedia.org/r/444910 [16:30:10] (03PS3) 10Arturo Borrero Gonzalez: install_server: partman: refresh labvirt-ssd partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/444837 [16:33:34] thcipriani: makedeploynotes.py complains with `ModuleNotFoundError: No module named 'requests'` [16:33:46] I guess I have to install dependencies? [16:34:23] yeah, python3-requests [16:34:42] although that seems to be installed on deploy1001 [16:34:53] ah, ran it on my machine [16:34:57] ok, running there :) [16:35:21] 10Operations, 10ops-eqiad, 10DBA: db1069 bad disk - https://phabricator.wikimedia.org/T199056 (10Marostegui) 05Open>03Resolved All good this time ``` root@db1069:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name : RAID Level :... [16:36:44] runs fine there [16:36:55] nice [16:37:56] thcipriani: it's magic! https://www.mediawiki.org/wiki/MediaWiki_1.32/wmf.12/Changelog [16:38:24] baham has been replaced succesfully with authdns2001, merges to operations/dns can be done as usual [16:38:32] \o/ [16:40:07] thcipriani: uh oh, so I should have done all this _before_ the window :) ok, finally at https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Switch_group0_wikis_to_VERSION [16:40:52] all if it can be done outside the window. I generally do the branch cut earlier, but, in general, I do everything else inside the window. [16:41:08] (03CR) 10Zfilipin: [C: 032] Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin) [16:41:24] sounds like fun first train zeljkof :D [16:41:34] addshore: oh yeah! :D [16:41:55] I remember reading the docs a few weeks back and being glad I didn't have to do it ;) [16:42:16] any estimate how long the remaining train will take? :) [16:42:26] * bd808 remembers writing the first public docs on a train deploy and being terrified [16:42:29] addshore: read the docs now, I've been working on them all day, haven't been in a better shape in a long time ;) [16:42:30] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin) [16:42:43] Lucas_WMDE: no :D [16:42:48] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444887 (owner: 10Zfilipin) [16:42:48] ok :D [16:42:51] estimate: two hours ago [16:42:58] obviously wrong [16:43:00] well, I’ll just wish you good luck then [16:43:07] thanks :D will need it [16:43:07] and see how it goes [16:43:16] the fist train ever [16:43:22] (for me, that is) [16:44:18] bd808: there are still a few things that raise my hair :D but thanks to thcipriani there is one less as of a few minutes ago [16:45:47] (03PS1) 10Rush: dumps: add labstore1006 back for dumps serving to cloud [puppet] - 10https://gerrit.wikimedia.org/r/444914 (https://phabricator.wikimedia.org/T198420) [16:47:02] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team (Current), 10User-Joe: Extension:JADE scalability concerns due to creating a page per revision - https://phabricator.wikimedia.org/T196547 (10mark) [16:47:17] 10Operations, 10monitoring, 10User-fgiunchedi: Handle SMART for multiple shelves attached to a single smartarray controller - https://phabricator.wikimedia.org/T199236 (10Bstorm) [16:47:32] (03PS2) 10Rush: dumps: add labstore1006 back for dumps serving to cloud [puppet] - 10https://gerrit.wikimedia.org/r/444914 (https://phabricator.wikimedia.org/T198420) [16:48:11] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Datasets-General-or-Unknown: rack upgraded storage capacity in labstore100[67].eqiad.wmnet - https://phabricator.wikimedia.org/T196651 (10Bstorm) [16:48:23] (03CR) 10Rush: [C: 032] dumps: add labstore1006 back for dumps serving to cloud [puppet] - 10https://gerrit.wikimedia.org/r/444914 (https://phabricator.wikimedia.org/T198420) (owner: 10Rush) [16:51:47] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.12 [16:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:45] 10Operations, 10Research, 10SRE-Access-Requests: Request access to data for citation usage research - https://phabricator.wikimedia.org/T198662 (10Miriam) @DarTar could sign off here to give @Pirroh access to the data? (See task description) Thanks! [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1700). [17:06:02] thcipriani: I'm... I'm done...? https://www.mediawiki.org/w/index.php?diff=2825343&oldid=2819773&title=MediaWiki_1.32/Roadmap&type=revision&diffmode=source [17:06:48] looks like it: https://www.mediawiki.org/wiki/Special:Version [17:06:55] kudos on the first train :) [17:07:45] thcipriani: party time! :) thanks for all the help! I think it's time for an adult beverage [17:08:39] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) [17:08:52] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) [17:11:27] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10BBlack) wgCacheEpoch is probably about the parser cache, which is separate from #Traffic 's Varnish caching. Either one could be an issue here, or... [17:19:28] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10matmarex) The redirects (that I know of) were implemented using JavaScript code in MediaWiki:Common.js etc, for example: * https://it.wikipedia.org... [17:20:02] (03PS1) 10Reedy: Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 [17:28:15] 10Operations, 10Discovery-Search (Current work): migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10Gehel) testing in progress on deployment-prep [17:40:00] (03PS2) 10Reedy: Make the 'affcomusergroup' require users to be logged in to send it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 [17:54:55] (03PS1) 10Ottomata: Set contact_group to admins for main MirrorMaker alerts [puppet] - 10https://gerrit.wikimedia.org/r/444925 [17:58:37] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Daimona) Several users have reported this problem, however they weren't really redirected: instead, while searching stuff on google, //google redir... [18:03:45] 10Operations: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10faidon) I'm using servermon for fact query regularly, but I think I'm one of the very few :) I admit I haven't played around much with puppetboard to adjust my use cases, so that may be something that could potentially work (with... [18:12:21] (03CR) 10Anomie: "This either needs to wait for Thursday after the train (SCHEMA_COMPAT_OLD is added in wmf.12), or it needs to use MIGRATION_OLD like the c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [18:13:46] (03CR) 10Alex Monk: [WIP] get rid of openssl CLI usage (031 comment) [software/certcentral] - 10https://gerrit.wikimedia.org/r/444631 (owner: 10Vgutierrez) [18:17:54] (03PS3) 10RobH: DNS: Add DNS asset tag mgmt for spare servers [dns] - 10https://gerrit.wikimedia.org/r/441063 (https://phabricator.wikimedia.org/T196666) (owner: 10Papaul) [18:18:12] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Daimona) I think we already purged common.js several times after the blackout, anyway let's see if it works. As for Google, I... [18:18:22] (03PS1) 10Aaron Schulz: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 [18:19:45] (03CR) 10RobH: [C: 032] DNS: Add DNS asset tag mgmt for spare servers [dns] - 10https://gerrit.wikimedia.org/r/441063 (https://phabricator.wikimedia.org/T196666) (owner: 10Papaul) [18:22:04] (03PS2) 10Aaron Schulz: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239) [18:22:59] (03CR) 10Daniel Kinzler: [C: 04-1] "Oh, you are right! I would have expected Jenkins to notice this... it should be running the deployed branch, not master, on this repo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440128 (https://phabricator.wikimedia.org/T174044) (owner: 10Addshore) [18:23:07] (03CR) 10Jforrester: [C: 04-1] "Requires wmf.13 to be live on meta." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy) [18:24:07] Reedy: time for another go methinks [18:24:15] (03CR) 10Reedy: "It's basically a noop without the code that uses it though" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy) [18:29:04] (03CR) 10Aaron Schulz: [C: 032] Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [18:30:40] (03Merged) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [18:33:49] (03CR) 10Jforrester: [C: 04-1] "Sure, but we don't want the train to randomly derail because it turns out this code breaks prod in some odd way. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444918 (owner: 10Reedy) [18:33:54] !log aaron@deploy1001 Synchronized wmf-config/mc.php: Make all non-test wikis write to both nutcracker and mcrouter again (duration: 00m 57s) [18:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:22] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792 (10Krenair) cleaned that up today per Krenair: I finally got a chance to note the config info for the old dumps puppetmaster in... [18:34:23] (03PS5) 10Alex Monk: Allow PuppetDB use on standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T72792) [18:35:55] (03CR) 10jenkins-bot: Make all non-test wikis write to both nutcracker and mcrouter again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444932 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [18:38:03] (03CR) 10Krinkle: "Indeed, mediawiki/includes/Setup.php:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester) [18:42:33] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) After discussion with @Cmjohnson its been decided we'll go ahead and attempt to get the mainboard replaced before doing the smarthands work i suggested above. @papaul was onsite and did the steps:... [18:50:02] (03PS2) 10Jforrester: Cleanup: Stop trying to set wgLocalTZOffset, it's wgLocalTZ*o*ffset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 [18:51:37] 10Operations, 10ops-eqsin, 10Traffic: cp5006 unresponsive - https://phabricator.wikimedia.org/T187157 (10RobH) {F23557620} Self dispatch SR971650695 scheduled, including a request for an onsite technician. Once they send me the shipping info, I'll open an inbound shipment ticket with eqsin. I'll then sch... [18:51:51] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Krinkle) >>! In T199252#4413493, @Daimona wrote: > As for Google, I don't have a link but I can ask for it if you want. What... [18:53:33] (03CR) 10Krinkle: [C: 031] Cleanup: Stop trying to set wgLocalTZOffset, it's wgLocalTZ*o*ffset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444665 (owner: 10Jforrester) [18:56:32] (03PS1) 10Krinkle: Clean up wgLocaltimezone (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444940 [18:56:34] (03PS1) 10Krinkle: Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T1900) [19:00:21] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666 (10RobH) [19:00:47] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/add to spares tracking 2 single cpu misc class systems - https://phabricator.wikimedia.org/T196666 (10RobH) 05Open>03Resolved >>! In T196666#4300081, @Papaul wrote: > Switch port information : > both servers are racked in D8 > wmf6652 ge-8/0/3 >... [19:07:15] (03CR) 10Jforrester: [C: 031] Clean up wgLocaltimezone (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444940 (owner: 10Krinkle) [19:07:24] (03CR) 10Jforrester: [C: 031] Clean up wgLocaltimezone (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444941 (owner: 10Krinkle) [19:16:28] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 50.49, 31.62, 23.18 [19:18:37] (03PS1) 10Krinkle: Remove unused $tmarray variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444946 (https://phabricator.wikimedia.org/T189966) [19:26:01] (03CR) 1020after4: [C: 032] Scap: UpdateInterwikiCache fix subclassing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441571 (https://phabricator.wikimedia.org/T196642) (owner: 10Thcipriani) [19:27:42] (03Merged) 10jenkins-bot: Scap: UpdateInterwikiCache fix subclassing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441571 (https://phabricator.wikimedia.org/T196642) (owner: 10Thcipriani) [19:28:03] (03CR) 10jenkins-bot: Scap: UpdateInterwikiCache fix subclassing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441571 (https://phabricator.wikimedia.org/T196642) (owner: 10Thcipriani) [19:34:06] (03CR) 1020after4: [C: 031] "Can I get someone from SRE to merge this? It's affecting some users' ability to get work done." [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [19:34:17] (03PS2) 1020after4: Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) [19:34:45] (03PS1) 10Andrew Bogott: Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947 [19:35:23] (03CR) 10Paladox: [C: 031] Phabricator: Double the rate limit and connection limit [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [19:35:33] (03CR) 10jerkins-bot: [V: 04-1] Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947 (owner: 10Andrew Bogott) [19:37:08] (03CR) 1020after4: [C: 031] "https://puppet-compiler.wmflabs.org/compiler03/11762/" [puppet] - 10https://gerrit.wikimedia.org/r/444810 (https://phabricator.wikimedia.org/T198974) (owner: 1020after4) [19:38:48] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 52.28, 35.71, 30.45 [19:39:47] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.50, 37.23, 30.35 [19:40:48] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 58.53, 38.14, 27.75 [19:40:54] (03PS2) 10Andrew Bogott: Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947 [19:42:13] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10BBlack) Was the "temporary" JS redirect a 301 perhaps? [19:43:40] (03CR) 10Framawiki: [C: 031] Create Publisher namespace in Bengali Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444664 (https://phabricator.wikimedia.org/T199028) (owner: 10Zoranzoki21) [19:44:55] (03PS3) 10Andrew Bogott: Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947 [19:46:41] (03CR) 10Andrew Bogott: [C: 032] Neutron: define dns_domain in config [puppet] - 10https://gerrit.wikimedia.org/r/444947 (owner: 10Andrew Bogott) [19:48:30] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Expire cache after Wikipedia/Wikimedia copyright protests - https://phabricator.wikimedia.org/T199252 (10Krinkle) >>! In T199252#4413739, @BBlack wrote: > Was the "temporary" JS redirect a 301 perhaps? Nope, it wasn't any form of... [19:50:38] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 46.74, 37.57, 32.67 [19:51:18] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 49.94, 30.21, 22.71 [19:52:27] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 48.63, 33.39, 25.94 [19:52:57] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:08] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 48.58, 30.07, 21.50 [19:53:27] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 54.44, 30.74, 20.24 [19:53:48] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 73926 bytes in 0.468 second response time [19:54:58] PROBLEM - High CPU load on API appserver on mw1229 is CRITICAL: CRITICAL - load average: 44.02, 36.55, 28.53 [19:55:08] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 54.10, 35.86, 23.95 [19:55:18] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 53.23, 36.52, 24.95 [19:55:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:55:38] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 49.35, 34.98, 23.04 [19:55:47] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 54.72, 39.35, 28.24 [19:56:07] PROBLEM - High CPU load on API appserver on mw1232 is CRITICAL: CRITICAL - load average: 56.61, 40.33, 29.37 [19:56:38] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:56:47] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:56:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:56:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:56:57] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:57:08] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:57:27] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:57:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:57:48] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [19:57:58] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:58:17] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:58:28] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [19:58:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:59:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:00:07] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 14.11, 24.69, 21.98 [20:01:17] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:01:32] well that's rather loud [20:01:40] I wonder if it's important [20:01:48] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:01:48] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:01:48] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:01:48] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 15.28, 24.78, 23.66 [20:02:05] Krinkle, any idea what's up? [20:02:20] I do not know. [20:02:27] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:02:29] Checkin logstash [20:02:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:02:47] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:03:19] https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 looks like there was a brief problem [20:03:27] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [20:03:27] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:03:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:03:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [20:03:40] Indeed. [20:03:45] Edit count also plummeted momentarily [20:03:46] https://grafana.wikimedia.org/dashboard/db/edit-count?orgId=1&from=1531239268571&to=1531253015803 [20:03:56] 10Operations, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955 (10Bstorm) [20:04:17] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 14.58, 22.44, 23.45 [20:04:18] save timing went up significantly (4X) - https://grafana.wikimedia.org/dashboard/db/save-timing?orgId=1&from=1531247940501&to=1531253040967 [20:05:01] Error logs show a significant ERROR increase [20:05:02] https://grafana.wikimedia.org/dashboard/db/production-logging?orgId=1&from=1531227634440&to=1531252968550 [20:05:15] Sustained for about 2 hours [20:08:57] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [20:09:58] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [20:10:04] It seems in Logstash, the majority of errors are: [20:10:06] > Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR [20:10:08] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [20:10:18] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [20:10:18] 4,700 hits in 15min [20:10:37] RECOVERY - High CPU load on API appserver on mw1232 is OK: OK - load average: 13.81, 19.00, 23.96 [20:10:58] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [20:11:07] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [20:11:17] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 11.87, 17.54, 23.91 [20:11:29] also lots of [20:11:30] > Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY [20:11:53] AaronSchulz: _joe_ : moritzm [20:12:38] RECOVERY - High CPU load on API appserver on mw1229 is OK: OK - load average: 13.96, 17.73, 23.99 [20:13:57] Krinkle: some kind of timeout issue? Odd. [20:14:04] The top key in logstash/memcached is 'wikibase_shared/1_32_0-wmf_10-wikidatawiki-hhvm:CacheAwarePropertyInfoStore' [20:14:17] Which follows the shape of the overall error burst exactly [20:15:38] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 10.82, 14.76, 23.67 [20:18:50] AaronSchulz: So the mcrouter write is now live for all wikis, right? [20:19:27] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 13.82, 15.58, 23.78 [20:20:03] It's still affecting snapshot100x hosts a lot. It used to have ~0 errors in channel:memcached, now 1000 per 5min and counting. [20:20:30] AaronSchulz: https://logstash.wikimedia.org/goto/a610950920965b1bb57e0c50a7130cc3 [20:20:50] yeah, the snapshot1001 thing is weird [20:21:18] Looking at all other servers only, it seems to have recovered. [20:21:34] > https://logstash.wikimedia.org/goto/85796fdc0d11e4bc3636650f7e18928e [20:24:24] (03CR) 10Jdlrobson: [C: 031] Enable page previews for all new editors (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga) [20:28:08] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 12.35, 14.20, 23.88 [20:30:25] Krinkle: I see that https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/442813/ is still not in wmf10 [20:31:07] I wonder if some lock()/add() thing caused some vimportant value to never be updated and expire or something [20:32:54] indeed [20:32:54] Krinkle: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/445005/ [20:33:14] lock/add() just fails deterministically every time for some keys. [20:34:26] the new version is in wmf12 at least (which is running in group0) [20:35:42] AaronSchulz: Yeah [20:35:53] AaronSchulz: Are there any other objectcache related patches not in wmf12 yet? [20:35:55] wmf10 * [20:35:59] Or was this the only one? [20:36:16] Just thinking whether we should do more at the same time and/or rollback until wmf12 is everywhere [20:36:20] just those two (the first one I already backported) [20:36:24] OK [20:36:26] let's do it [20:36:29] (that was the encoding one) [20:37:17] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 17.35, 18.15, 23.82 [20:42:33] (03PS3) 10Mforns: Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) [20:43:01] (03PS4) 10Mforns: Add HDFS whitelist path to EventLoggingSanitization job [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) [20:45:03] * AaronSchulz wonders if there are any projects to make jenkins faster [20:45:30] (03CR) 10Mforns: "This should be good to go. Also, the backfilling went up to 7th of July, so when we merge this, it will catch up auotmatically since then." [puppet] - 10https://gerrit.wikimedia.org/r/442121 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [20:48:42] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/includes/libs/objectcache/MultiWriteBagOStuff.php: 4fba9f6a032 (duration: 00m 57s) [20:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:40] !log aaron@deploy1001 Synchronized php-1.32.0-wmf.10/tests/phpunit/includes/libs/objectcache/MultiWriteBagOStuffTest.php: 4fba9f6a032 (duration: 00m 56s) [20:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:48] 10Operations, 10Cloud-VPS, 10procurement: rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10chasemp) a:05chasemp>03Andrew [21:04:33] !log re-configure GTT circuit in eqiad/knams [21:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:36] (continuing in -perf) [21:07:07] (03PS1) 10Rush: openstack: eqiad1-r metadata agent for net role [puppet] - 10https://gerrit.wikimedia.org/r/445020 (https://phabricator.wikimedia.org/T196633) [21:08:28] (03CR) 10Rush: [C: 032] openstack: eqiad1-r metadata agent for net role [puppet] - 10https://gerrit.wikimedia.org/r/445020 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [21:12:51] (03PS1) 10Andrew Bogott: vmbuilder and bootstrapvz: get hostname from metadata [puppet] - 10https://gerrit.wikimedia.org/r/445021 [21:13:44] (03CR) 10Andrew Bogott: [C: 032] vmbuilder and bootstrapvz: get hostname from metadata [puppet] - 10https://gerrit.wikimedia.org/r/445021 (owner: 10Andrew Bogott) [21:16:04] (03PS1) 10Rush: openstack: eqiad1 deployment net role notes [puppet] - 10https://gerrit.wikimedia.org/r/445022 [21:18:54] (03PS1) 10Rush: openstack: add labnet100[34] VLAN 1120 reservations [dns] - 10https://gerrit.wikimedia.org/r/445023 (https://phabricator.wikimedia.org/T196633) [21:24:37] PROBLEM - nutcracker process on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:24:37] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [21:24:38] PROBLEM - apertium apy on scb2001 is CRITICAL: HTTP CRITICAL - No data received from host [21:24:58] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/announcements (Retrieve announcements) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) timed out before a response was received: /_info (retrieve service info) timed out before a response was received: /{domain}/v1/page/news (get In the New [21:24:58] ported language (with aggregated=true)) timed out before a response was received [21:25:28] RECOVERY - nutcracker process on scb2001 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [21:25:37] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [21:25:47] RECOVERY - apertium apy on scb2001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.076 second response time [21:25:58] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [21:26:28] (03PS2) 10Rush: openstack: eqiad1 deployment net role notes [puppet] - 10https://gerrit.wikimedia.org/r/445022 [21:27:42] (03CR) 10Rush: [C: 032] openstack: eqiad1 deployment net role notes [puppet] - 10https://gerrit.wikimedia.org/r/445022 (owner: 10Rush) [21:31:13] AaronSchulz: Krinkle ya'll OK? [21:32:17] greg-g: Yeah, seems the mcrouter deployment for all wikis went ahead of a critical fix for the memc_add() function, which we knew about and fixed last week, but forgot to backport given the train is offset by one week from the original schedule. [21:32:30] Caused a cascading failure we'll write up later, but for now, things are fine. [21:32:37] I've shared a preliminary write up with you. [21:33:41] Krinkle: ack, thanks. [21:40:03] (03PS1) 10Reedy: Stop logging email changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445030 [21:40:09] !log reboot labnet100[34] [21:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:34] (03PS2) 10Catrope: Rollout Watchlist Structured Filters to all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440642 (https://phabricator.wikimedia.org/T181193) (owner: 10Mooeypoo) [22:13:22] (03PS1) 10Thcipriani: Scap: Bump version to 3.8.4 [puppet] - 10https://gerrit.wikimedia.org/r/445031 (https://phabricator.wikimedia.org/T199283) [22:14:48] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Services (blocked): Update Debian Package for Scap3 to 3.8.4-1 - https://phabricator.wikimedia.org/T199283 (10thcipriani) Adding #Operations for the package update + puppet patch. [22:21:47] PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:22:01] jouncebot: next [22:22:02] In 0 hour(s) and 37 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T2300) [22:48:17] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:56:00] (03PS1) 10Jdlrobson: Scrub ambox images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445036 [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180710T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:07:31] (03CR) 10Pmiazga: "Jdlrobson - that's correct, but as Olga said, we want that as a default behavior no matter what. Popups are visible to anons by default, i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga) [23:40:47] (03CR) 10Jdlrobson: [C: 031] "I read Olga's comment to mean all wikis that have the feature enabled already. Note we haven't spoken to any of the projects that don't ha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444906 (https://phabricator.wikimedia.org/T197719) (owner: 10Pmiazga)