[00:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171117T0000). [00:00:05] kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:05:17] greg-g: are swats frozen too? [00:05:27] Is SWAT deploy happening? [00:05:36] what's the patch? [00:05:52] just a config change... [00:06:01] https://gerrit.wikimedia.org/r/#/c/390131/ [00:06:03] that's OK [00:06:12] cool [00:06:19] I know there's a timeline for that, and it won't effect anything on the effected database [00:06:28] I'll deploy [00:06:30] thanks! [00:07:10] MaxSem: Before you deploy that... [00:07:23] (03PS3) 10MaxSem: Create new MP3 Uploaders group on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390131 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [00:07:25] (03CR) 10MaxSem: [C: 032] Create new MP3 Uploaders group on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390131 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [00:08:25] (03Merged) 10jenkins-bot: Create new MP3 Uploaders group on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390131 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [00:08:37] (03CR) 10jenkins-bot: Create new MP3 Uploaders group on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390131 (https://phabricator.wikimedia.org/T180002) (owner: 10Kaldari) [00:08:47] MaxSem: The config change depends on a change on the wmf.8 train [00:09:03] waah? [00:09:04] RECOVERY - Host mw2251 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [00:09:05] what? [00:09:06] which I guess didn't roll to en.wiki, but maybe is on Commons? [00:09:12] it's on commons, yeah [00:09:17] https://tools.wmflabs.org/versions/ [00:09:42] MaxSem: Oh cool, it's fine then [00:10:23] (03CR) 10Dzahn: "It's better to just have a single module than 2 of them. If we can use the mariadb module for everything that would be ideal. So i'm all w" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [00:11:05] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:11:55] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 79170 bytes in 0.283 second response time [00:13:49] kaldari: live on mwdebug1002 [00:14:23] MaxSem: Anything else after this patch? If not, I'd like to take a moment after this to roll out https://gerrit.wikimedia.org/r/#/c/391162/ carefully. [00:14:53] nothing, I'll ping you [00:16:08] Thx [00:16:09] Krinkle: be careful please :) :) [00:16:31] 10Operations, 10ops-codfw: mw2251 hardware error - https://phabricator.wikimedia.org/T180724#3768664 (10Papaul) Full memory test came out with no errors. I went ahead and update the IDRAC firmware as well. The system is back up online. [00:21:42] MaxSem: Good to go. [00:22:56] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/390131/3 (duration: 00m 49s) [00:22:59] !log Decommissioning Cassandra, restbase1014-c.eqiad.wmnet (T179422) [00:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:07] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [00:23:55] (03PS8) 10BBlack: normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [00:26:33] Krinkle: all yours [00:27:55] (03PS4) 10Lokal Profil: Support prefixed dump types [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390312 (https://phabricator.wikimedia.org/T163328) [00:28:57] (03CR) 10Lokal Profil: "recheck" [dumps/dcat] - 10https://gerrit.wikimedia.org/r/390312 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [00:29:06] (03CR) 10Krinkle: [C: 032] Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:29:16] MaxSem: ack [00:30:21] (03Merged) 10jenkins-bot: Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:30:35] (03CR) 10jenkins-bot: Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [00:31:52] (03PS1) 10BryanDavis: toolforge: Replace /usr/local/bin/crontab with oge-crontab [puppet] - 10https://gerrit.wikimedia.org/r/391979 (https://phabricator.wikimedia.org/T156174) [00:37:59] !log krinkle@tin Synchronized wmf-config/profiler.php: I60cce0eb51101d9e3fe (duration: 00m 48s) [00:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:02] !log krinkle@tin Synchronized wmf-config/StartProfiler.php: I60cce0eb51101d9e3fe (duration: 00m 49s) [00:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:41:04] !log krinkle@tin Synchronized wmf-config/PhpAutoPrepend.php: I60cce0eb51101d9e3fe (duration: 00m 48s) [00:41:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:50] (03PS1) 10Dzahn: icinga: test creating individual contact secrets [puppet] - 10https://gerrit.wikimedia.org/r/391980 [00:46:30] (03CR) 10jerkins-bot: [V: 04-1] icinga: test creating individual contact secrets [puppet] - 10https://gerrit.wikimedia.org/r/391980 (owner: 10Dzahn) [00:46:38] (03PS2) 10Dzahn: icinga: test creating individual contact secrets [puppet] - 10https://gerrit.wikimedia.org/r/391980 (https://phabricator.wikimedia.org/T164238) [00:47:16] (03CR) 10jerkins-bot: [V: 04-1] icinga: test creating individual contact secrets [puppet] - 10https://gerrit.wikimedia.org/r/391980 (https://phabricator.wikimedia.org/T164238) (owner: 10Dzahn) [00:57:36] (03PS1) 10Krinkle: labs: Clean up outdated wgProfiler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391982 (https://phabricator.wikimedia.org/T180766) [00:57:50] (03PS2) 10Krinkle: labs: Clean up outdated wgProfiler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391982 (https://phabricator.wikimedia.org/T180766) [00:58:05] Done :) [01:02:27] (03PS3) 10Krinkle: labs: Clean up outdated wgProfiler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391982 (https://phabricator.wikimedia.org/T180766) [01:05:03] (03PS1) 10Krinkle: labs: Enable profiler based on same signals as production does [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391987 (https://phabricator.wikimedia.org/T180766) [01:05:44] PROBLEM - mediawiki-installation DSH group on mw2251 is CRITICAL: Host mw2251 is not in mediawiki-installation dsh group [01:06:16] (03CR) 10jerkins-bot: [V: 04-1] labs: Enable profiler based on same signals as production does [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391987 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [01:06:19] (03PS1) 10Dzahn: fake icinga contact secrets for testing [labs/private] - 10https://gerrit.wikimedia.org/r/391988 [01:06:53] (03PS2) 10Krinkle: labs: Enable profiler based on same signals as production does [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391987 (https://phabricator.wikimedia.org/T180766) [01:07:34] (03PS2) 10Dzahn: fake icinga contact secrets for testing [labs/private] - 10https://gerrit.wikimedia.org/r/391988 [01:07:39] (03PS3) 10Dzahn: fake icinga contact secrets for testing [labs/private] - 10https://gerrit.wikimedia.org/r/391988 [01:08:03] (03CR) 10Dzahn: [V: 032 C: 032] fake icinga contact secrets for testing [labs/private] - 10https://gerrit.wikimedia.org/r/391988 (owner: 10Dzahn) [01:08:17] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/391980 (https://phabricator.wikimedia.org/T164238) (owner: 10Dzahn) [01:08:58] (03CR) 10jerkins-bot: [V: 04-1] icinga: test creating individual contact secrets [puppet] - 10https://gerrit.wikimedia.org/r/391980 (https://phabricator.wikimedia.org/T164238) (owner: 10Dzahn) [02:18:05] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [02:18:54] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [02:49:27] RECOVERY - MariaDB Slave SQL: s5 on db1099 is OK: OK slave_sql_state Slave_SQL_Running: Yes [03:04:04] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 316. [03:04:34] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [03:24:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 777.65 seconds [03:44:20] (03CR) 10Tim Starling: [C: 031] "Looks good, OK for self-merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391987 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [03:44:56] (03CR) 10Tim Starling: [C: 031] "Looks good, OK for self-merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391982 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [03:59:04] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 288.24 seconds [03:59:44] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [04:00:35] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [04:04:17] RECOVERY - MariaDB Slave SQL: s5 on db1106 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:12:55] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [04:12:55] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [04:15:04] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for Barack Obama) timed out before a response was received: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) timed out before a response was received: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-se [04:15:04] out before a response was received: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received [04:15:24] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed o [04:15:27] e was received [04:17:04] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [04:17:25] PROBLEM - cxserver endpoints health on scb1002 is CRITICAL: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received: /v1/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received [04:18:24] RECOVERY - cxserver endpoints health on scb1002 is OK: All endpoints are healthy [04:21:19] 10Operations, 10vm-requests, 10Patch-For-Review, 10Performance-Team (Radar): Request VM for webperf (metrics processing) - https://phabricator.wikimedia.org/T179036#3768929 (10faidon) a:03R3609901 What's the status of this? [04:33:20] 10Operations, 10vm-requests, 10Patch-For-Review, 10Performance-Team (Radar): Request VM for webperf (metrics processing) - https://phabricator.wikimedia.org/T179036#3768958 (10mmodell) a:05R3609901>03None [04:40:34] (03CR) 10Krinkle: [C: 032] labs: Clean up outdated wgProfiler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391982 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [04:40:35] (03CR) 10Krinkle: [C: 032] labs: Enable profiler based on same signals as production does [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391987 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [04:41:36] (03Merged) 10jenkins-bot: labs: Clean up outdated wgProfiler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391982 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [04:41:45] (03Merged) 10jenkins-bot: labs: Enable profiler based on same signals as production does [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391987 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [04:41:49] (03CR) 10jenkins-bot: labs: Clean up outdated wgProfiler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391982 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [04:43:06] !log krinkle@tin Synchronized wmf-config/CommonSettings-labs.php: no-op for labs (duration: 00m 51s) [04:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:30] (03CR) 10jenkins-bot: labs: Enable profiler based on same signals as production does [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391987 (https://phabricator.wikimedia.org/T180766) (owner: 10Krinkle) [04:48:58] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3767261 (10Marostegui) reverts for logging and recentchanges on dewiki and wikidatawiki are finished for most of the hosts (still running for db1071 as its hardware isn't as powerful).... [04:50:32] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#3768987 (10faidon) a:03RobH We've fixed so many issues over the past few months that I can't even count them :) Thanks all. I did another sweep today and found these that need fixing: **Missing purchase d... [04:52:01] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768990 (10Marostegui) archive done doing ipblocks now [04:52:16] RECOVERY - MariaDB Slave SQL: s5 on db1104 is OK: OK slave_sql_state Slave_SQL_Running: Yes [04:52:24] PROBLEM - HHVM rendering on mw2122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:53:14] RECOVERY - HHVM rendering on mw2122 is OK: HTTP OK: HTTP/1.1 200 OK - 79196 bytes in 0.312 second response time [04:56:43] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768994 (10Marostegui) ipblocks done filearchive done oldimage done protected_titles done servers are now catching up [04:59:17] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3768998 (10Marostegui) Reminder: we have to enable GTID on the slaves. [05:07:26] (03PS1) 10Marostegui: db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) [05:07:57] RECOVERY - MariaDB Slave Lag: s5 on db1104 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:08:04] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [05:08:07] RECOVERY - MariaDB Slave Lag: s5 on db1106 is OK: OK slave_sql_lag Replication lag: 0.44 seconds [05:08:55] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [05:09:04] (03PS2) 10Marostegui: db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) [05:09:47] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: db1063 crashed - https://phabricator.wikimedia.org/T180714#3769005 (10Marostegui) All the hosts (apart from db1071) are now up to date and ready to be pooled. I have done: https://gerrit.wikimedia.org/r/#/c/391995/2 but I will wait for another... [05:11:19] (03PS1) 10Marostegui: s5.hosts: db1070 is now the master [software] - 10https://gerrit.wikimedia.org/r/391996 (https://phabricator.wikimedia.org/T180714) [05:11:57] RECOVERY - MariaDB Slave Lag: s5 on db1099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [05:13:02] (03CR) 10Marostegui: [C: 032] s5.hosts: db1070 is now the master [software] - 10https://gerrit.wikimedia.org/r/391996 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [05:14:27] (03Merged) 10jenkins-bot: s5.hosts: db1070 is now the master [software] - 10https://gerrit.wikimedia.org/r/391996 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [05:18:25] PROBLEM - Apache HTTP on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:15] RECOVERY - Apache HTTP on mw2133 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.115 second response time [05:19:24] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [05:19:34] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [05:21:14] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [05:21:24] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [05:23:05] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [05:23:15] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [05:38:04] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:14] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [05:48:04] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [06:06:15] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [06:07:04] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [06:08:04] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:18:35] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [06:19:15] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [06:31:23] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769028 (10Marostegui) [06:31:55] PROBLEM - puppet last run on mw2168 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:32:05] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:32:54] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:34:49] !log Revert schema changes on s5 codfw master with replication enabled, lag will be generated on codfw s5 - T180714 [06:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:57] T180714: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714 [06:36:24] PROBLEM - Apache HTTP on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:17] RECOVERY - Apache HTTP on mw2129 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.132 second response time [06:42:55] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:43:14] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:51:24] (03PS3) 10Marostegui: db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) [06:56:54] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [06:56:55] RECOVERY - puppet last run on mw2168 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:44] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [07:08:04] (03PS4) 10Marostegui: db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) [07:11:26] (03PS5) 10Marostegui: db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) [07:14:12] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [07:14:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [07:15:37] (03Merged) 10jenkins-bot: db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [07:16:28] (03CR) 10jenkins-bot: db-eqiad.php: Start repooling all the hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391995 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [07:17:18] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool s5 eqiad hosts after reverting schema change - T180714 (duration: 00m 50s) [07:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:25] T180714: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714 [07:30:43] (03Abandoned) 10Marostegui: db-eqiad.php: Pool db1100 as vslow [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391873 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [07:33:19] !log Enable GTID on all the eqiad up-to-date hosts, only pending db1071 - T180714 [07:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:27] T180714: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714 [07:34:49] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769053 (10Marostegui) p:05Unbreak!>03Normal Setting back priority to normal as we are back to a normal state now. Pending things: - move dbstore1001 under the new... [07:39:52] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391997 (https://phabricator.wikimedia.org/T180714) [07:50:12] !log Move dbstore1002 under db1070 - T180714 [07:50:14] RECOVERY - MariaDB Slave IO: s5 on dbstore1002 is OK: OK slave_io_state Slave_IO_Running: Yes [07:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:19] T180714: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714 [07:51:04] !log Revert schema change on dbstore1002 - T180714 [07:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:36] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3769064 (10mobrovac) >>! In T180366#3768201, @Ro... [08:01:09] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769069 (10Marostegui) [08:04:32] !log reboot stat100[456] for kernel updates [08:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:29] (03PS2) 10Muehlenhoff: statistics::sites::pivot: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388510 [08:20:38] (03CR) 10Muehlenhoff: [C: 032] statistics::sites::pivot: Restrict access [puppet] - 10https://gerrit.wikimedia.org/r/388510 (owner: 10Muehlenhoff) [08:25:10] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3769076 (10mobrovac) >>! In T180384#3767834, @Jdlrobson wrote: > @Pchelolo asked me a few questions > >> are you up for being a ma... [08:26:25] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769079 (10Marostegui) [08:27:08] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391997 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [08:28:21] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391997 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [08:28:32] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for some hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391997 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [08:30:06] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for the s5 eqiad hosts that were out for hours due to schema change revert after s5 master crash (duration: 00m 50s) [08:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/390402 (https://phabricator.wikimedia.org/T179964) (owner: 10Gehel) [08:43:58] (03CR) 10Muehlenhoff: [C: 031] "Ack, looks all good to me." [puppet] - 10https://gerrit.wikimedia.org/r/391797 (https://phabricator.wikimedia.org/T178799) (owner: 10Giuseppe Lavagetto) [08:44:30] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769090 (10Marostegui) db1063 is totally broken and won't start: https://phabricator.wikimedia.org/P6337 [08:45:17] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769091 (10Marostegui) [08:53:17] (03CR) 10Muehlenhoff: "I pinged #wikimedia-releng, if there are no objections I'll merge that on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [08:54:00] moritzm: ^thanks! I'll on vacation... but ping me in case of need, I **might** be around [08:54:54] !log bootstrap restbase2006-b - T179422 [08:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:02] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [08:55:43] (03CR) 10Muehlenhoff: [C: 04-1] "As mentioned on IRC, just adding to Gerrit for completeness, this also needs to cover the stat hosts (which are granted access /etc/rsync." [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) (owner: 10Rush) [08:56:59] (03PS5) 10Volans: Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) [08:57:20] gehel: np, I''ll doublecheck the status of /etc/apt/sources.d/ before merging myself, enjoy your day off [08:59:17] moritzm: _week_ off ! Yeah! [09:00:37] even more reason not to disturb :-) [09:01:21] (03CR) 10Volans: [C: 032] Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [09:04:50] !log ppchelko@tin Started deploy [cpjobqueue/deploy@cbd25d3]: Bump overall concurrency to get rid of RecordLintJob backlog [09:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:26] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@cbd25d3]: Bump overall concurrency to get rid of RecordLintJob backlog (duration: 00m 35s) [09:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:59] (03PS3) 10Volans: Grafana: add graph to Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391525 (https://phabricator.wikimedia.org/T170353) [09:06:03] (03PS1) 10Elukey: profile::pmacct: move configuration to Kafka Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) [09:06:27] (03CR) 10jerkins-bot: [V: 04-1] profile::pmacct: move configuration to Kafka Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) (owner: 10Elukey) [09:06:46] (03CR) 10Volans: [C: 032] Grafana: add graph to Swift dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391525 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [09:07:24] typooosss [09:08:25] (03PS2) 10Elukey: profile::pmacct: move configuration to Kafka Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) [09:10:59] !log cleanup leftover titlesuggest indices on elasticsearch eqiad (jawiki, frwiki, ptwiki) [09:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:05] !log changing master of db1071 (63 -> 70) [09:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:16] (03CR) 10Volans: "@elukey: done" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [09:17:32] (03PS6) 10Volans: Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) [09:17:34] (03PS6) 10Volans: Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [09:17:36] (03PS7) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [09:18:25] elukey: updated ^^^ [09:20:14] !log start restbase2006-c instead, restbase2006-b failed and -c shows as "down" - T179422 [09:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:23] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [09:23:24] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769141 (10Marostegui) [09:24:32] volans: LGTM :) [09:24:34] (03PS3) 10Elukey: profile::pmacct: move configuration to Kafka Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) [09:26:08] thx [09:36:15] (03PS1) 10Muehlenhoff: Drop pmtpa IP address from role::labs::nfs::misc::dump_servers_ips [puppet] - 10https://gerrit.wikimedia.org/r/392014 [09:36:18] (03PS4) 10Elukey: profile::pmacct: move configuration to Kafka Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) [09:36:45] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769161 (10Marostegui) [09:38:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/8832/rhenium.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) (owner: 10Elukey) [09:38:28] !log ppchelko@tin Started deploy [cpjobqueue/deploy@2141162]: Bump concurrency even more [09:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:58] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@2141162]: Bump concurrency even more (duration: 00m 29s) [09:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:01] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [09:42:51] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [09:44:50] PROBLEM - DPKG on neodymium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:47:50] RECOVERY - DPKG on neodymium is OK: All packages OK [09:49:07] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3769169 (10akosiaris) I 've tried building the package once more. It fails as it needs a tag for 00820cbd6bbcc98321c5a0d279394673425d0783 (I am gue... [09:50:27] (03CR) 10Hashar: [C: 031] "I did that for a CI monitoring probe that yells:" [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [09:51:29] (03CR) 10Hashar: [C: 031] "If something ends up missing, we can catch up later and properly puppetize the hacked up apt list. I guess once puppet has run accross t" [puppet] - 10https://gerrit.wikimedia.org/r/390377 (owner: 10Gehel) [09:53:38] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-Elukey: pmacct should be upgraded to 1.6.2 on Stretch - https://phabricator.wikimedia.org/T173489#3769192 (10elukey) pmacct 1.7 should not be strictly needed now that we have a new Kafka 0.11 cluster (jumbo)! [09:54:12] !log rebooting labnodepool* for update to 4.9.51 [09:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:33] 10Operations, 10Cloud-VPS: rack/setup/install labnodepool1002.eqiad.wmnet - https://phabricator.wikimedia.org/T168407#3363615 (10hashar) Nodepool will be phased out eventually over the next 6 months or so. So maybe we can recycle `labnodepool1002.eqiad.wmnet`? I am not sure it is worth the time to setup a spa... [09:55:05] (03PS5) 10Elukey: profile::pmacct: move configuration to Kafka Jumbo [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) [09:56:02] (03PS1) 10Marostegui: db-eqiad.php: Restore original weight for s5 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392017 (https://phabricator.wikimedia.org/T180714) [09:57:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore original weight for s5 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392017 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [09:58:51] (03Merged) 10jenkins-bot: db-eqiad.php: Restore original weight for s5 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392017 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [09:59:00] (03CR) 10jenkins-bot: db-eqiad.php: Restore original weight for s5 hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392017 (https://phabricator.wikimedia.org/T180714) (owner: 10Marostegui) [09:59:49] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original traffic for the s5 eqiad hosts that were out for hours due to schema change revert after s5 master crash (duration: 00m 48s) [09:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:50] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [10:01:45] !log rebooting darmstadtium (docker registry) for update to 4.9.51 [10:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:11] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:19:49] (03PS20) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [10:20:11] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 12 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:25:22] (03PS1) 10Jcrespo: wikireplicas: Repoint web and analytics back to the original servers [puppet] - 10https://gerrit.wikimedia.org/r/392020 (https://phabricator.wikimedia.org/T179244) [10:30:50] (03CR) 10Marostegui: [C: 031] wikireplicas: Repoint web and analytics back to the original servers [puppet] - 10https://gerrit.wikimedia.org/r/392020 (https://phabricator.wikimedia.org/T179244) (owner: 10Jcrespo) [10:34:16] (03CR) 10Filippo Giunchedi: [C: 031] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:34:36] (03CR) 10Elukey: [V: 032 C: 032] First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:41:37] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769251 (10Marostegui) I have copied db1063's binlogs over to: ``` root@dbstore1001:/srv/tmp/T180714# ls -lh total 21G -rw-r--r-- 1 root root 21G Nov 17 10:39 db1063_b... [10:42:00] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3769252 (10Marostegui) [10:47:11] 10Operations, 10ops-codfw: mw2251 hardware error - https://phabricator.wikimedia.org/T180724#3769259 (10akosiaris) 05Open>03Resolved a:03akosiaris I am willing to bet this will show up again. Memory errors don't just go away, no matter what ePSA says. In any case I guess we can close this and reopen it w... [10:47:27] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw2251.codfw.wmnet [10:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:50] !log pool mw2251 T180724 [10:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:56] T180724: mw2251 hardware error - https://phabricator.wikimedia.org/T180724 [10:48:00] I 'll sync mediawiki-config as well just to be sure [10:48:37] !log akosiaris@tin Started scap: (no justification provided) [10:48:40] !log akosiaris@tin scap aborted: (no justification provided) (duration: 00m 02s) [10:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:44] !log sync wmf-config/db-eqiad.php for T180724 [10:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:00] !log akosiaris@tin Synchronized wmf-config/db-eqiad.php: (no justification provided) (duration: 00m 49s) [10:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:09] (03PS1) 10Elukey: Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 [10:53:05] (03CR) 10Jcrespo: [C: 032] wikireplicas: Repoint web and analytics back to the original servers [puppet] - 10https://gerrit.wikimedia.org/r/392020 (https://phabricator.wikimedia.org/T179244) (owner: 10Jcrespo) [11:01:55] !log create webperf1001, webperf2001 in ganeti T179036 [11:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:02] T179036: Request VM for webperf (metrics processing) - https://phabricator.wikimedia.org/T179036 [11:04:41] (03CR) 10ArielGlenn: [C: 032] add labstore1006 to list of hosts for rolling rsync of xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/391905 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [11:04:47] (03PS2) 10ArielGlenn: add labstore1006 to list of hosts for rolling rsync of xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/391905 (https://phabricator.wikimedia.org/T171541) [11:04:58] dammit. meant to rebase first [11:05:50] RECOVERY - mediawiki-installation DSH group on mw2251 is OK: OK [11:28:02] (03PS1) 10Jcrespo: mariadb: Remove legacy parameter mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/392027 [11:30:27] 10Operations, 10vm-requests, 10Patch-For-Review, 10Performance-Team (Radar): Request VM for webperf (metrics processing) - https://phabricator.wikimedia.org/T179036#3769324 (10Dzahn) a:03Dzahn [11:30:50] PROBLEM - DPKG on mwlog2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [11:31:50] RECOVERY - DPKG on mwlog2001 is OK: All packages OK [11:33:23] !log installing openssl updates on puppetmasters [11:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:15] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/8833/" [puppet] - 10https://gerrit.wikimedia.org/r/392027 (owner: 10Jcrespo) [11:37:22] (03PS2) 10Jcrespo: mariadb: Remove legacy parameter mariadb10 [puppet] - 10https://gerrit.wikimedia.org/r/392027 [11:41:10] (03PS1) 10Marostegui: db1063.yaml: Clean up after its crash [puppet] - 10https://gerrit.wikimedia.org/r/392029 [11:43:00] !log installing openssl updates on dbproxy hosts [11:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:15] (03CR) 10Jcrespo: [C: 031] db1063.yaml: Clean up after its crash [puppet] - 10https://gerrit.wikimedia.org/r/392029 (owner: 10Marostegui) [11:43:27] (03CR) 10Marostegui: [C: 032] db1063.yaml: Clean up after its crash [puppet] - 10https://gerrit.wikimedia.org/r/392029 (owner: 10Marostegui) [11:44:09] (03PS3) 10ArielGlenn: move remaining hardcoded paths for base dir of misc dumps out to profile [puppet] - 10https://gerrit.wikimedia.org/r/381838 (https://phabricator.wikimedia.org/T175528) [11:46:44] !log installing openssl updates on etcd* hosts [11:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:24] 10Operations, 10vm-requests, 10Patch-For-Review, 10Performance-Team (Radar): Request VM for webperf (metrics processing) - https://phabricator.wikimedia.org/T179036#3769354 (10Dzahn) Can we use stretch? I'll assume stretch unless there are reasons not to. [11:51:27] (03PS1) 10Dzahn: add webperf1001/2001 to site, using webperf role [puppet] - 10https://gerrit.wikimedia.org/r/392030 (https://phabricator.wikimedia.org/T179036) [11:57:45] (03PS1) 10Dzahn: install_server/DHCP: add webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/392031 (https://phabricator.wikimedia.org/T179036) [11:58:27] (03CR) 10Dzahn: "flat.cfg ok? stretch ok?" [puppet] - 10https://gerrit.wikimedia.org/r/392031 (https://phabricator.wikimedia.org/T179036) (owner: 10Dzahn) [12:01:48] (03PS1) 10Alexandros Kosiaris: Introduce webperf1001, webperf2001 [puppet] - 10https://gerrit.wikimedia.org/r/392035 (https://phabricator.wikimedia.org/T179036) [12:02:16] (03PS1) 10Phedenskog: webperf: Add missing mediaWikiLoad to navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/392036 (https://phabricator.wikimedia.org/T180598) [12:02:33] (03CR) 10jerkins-bot: [V: 04-1] Introduce webperf1001, webperf2001 [puppet] - 10https://gerrit.wikimedia.org/r/392035 (https://phabricator.wikimedia.org/T179036) (owner: 10Alexandros Kosiaris) [12:03:18] (03CR) 10Dzahn: "netboot.cfg / partman, using flat.cfg? also https://gerrit.wikimedia.org/r/#/c/392031/1 heh" [puppet] - 10https://gerrit.wikimedia.org/r/392035 (https://phabricator.wikimedia.org/T179036) (owner: 10Alexandros Kosiaris) [12:03:51] ah :) [12:07:08] !log installing openssl updates on graphite* hosts [12:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:31] (03CR) 10Hashar: sample uwsgi app that would produce json status output for dumps (033 comments) [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/335007 (https://phabricator.wikimedia.org/T147177) (owner: 10ArielGlenn) [12:09:45] akosiaris: i started those too, duplicates now :) take any [12:09:56] or i can [12:10:15] (03PS1) 10Hashar: Add tox and pass flake8 [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/392037 (https://phabricator.wikimedia.org/T180328) [12:10:16] mutante: ok I am backing up then. it's all yours [12:10:44] (03CR) 10Hashar: "That should probably be added to the initial change: https://gerrit.wikimedia.org/r/#/c/335007/" [dumps/statusapi] - 10https://gerrit.wikimedia.org/r/392037 (https://phabricator.wikimedia.org/T180328) (owner: 10Hashar) [12:11:09] (03Abandoned) 10Alexandros Kosiaris: Introduce webperf1001, webperf2001 [puppet] - 10https://gerrit.wikimedia.org/r/392035 (https://phabricator.wikimedia.org/T179036) (owner: 10Alexandros Kosiaris) [12:11:11] akosiaris: ok! i did assume we can use stretch, will see [12:11:33] i mean, i can check if the webperf role works on that [12:12:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "flat.cfg is definetely ok, not sure about stretch but let's go for it and we 'll see" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392031 (https://phabricator.wikimedia.org/T179036) (owner: 10Dzahn) [12:13:15] duh :) ok [12:13:32] (03CR) 10Deskana: [C: 04-1] "You also need to add the time the variable was switched on in wmgVisualEditorSingleEditTabSwitchTime, which is immediately underneath wmgV" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) (owner: 10Jayprakash12345) [12:13:47] (03PS2) 10Dzahn: install_server/DHCP: add webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/392031 (https://phabricator.wikimedia.org/T179036) [12:14:15] (03CR) 10Alexandros Kosiaris: [C: 031] install_server/DHCP: add webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/392031 (https://phabricator.wikimedia.org/T179036) (owner: 10Dzahn) [12:14:54] (03CR) 10Dzahn: [C: 032] install_server/DHCP: add webperf1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/392031 (https://phabricator.wikimedia.org/T179036) (owner: 10Dzahn) [12:15:36] !log installing openssl updates on poolcounters [12:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:38] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3769430 (10Peter) Thanks @fgiunchedi . We will not start with the WebPageTest task yet before we have solution that we all agree on. [12:32:17] hmm.. it feels like this happened to me before.. installing OS on new ganeti VM.. sites there after Loading debian-installer/amd64/initrd.gz...ok but feels like it just stops [12:35:41] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [12:37:31] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [12:43:21] (03CR) 10Jayprakash12345: "Sir can you tell me the value. Is this 20171118000000." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391756 (https://phabricator.wikimedia.org/T180660) (owner: 10Jayprakash12345) [12:46:25] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3769460 (10MoritzMuehlenhoff) Yeah, I guess that would be an alternative to consider. [12:47:50] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [12:49:41] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [12:56:58] (03PS9) 10BBlack: Fully normalize upload paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T171881) [12:59:25] (03PS2) 10BBlack: ssl_ciphersuite: dump 3DES on 2017-11-17 [puppet] - 10https://gerrit.wikimedia.org/r/384578 (https://phabricator.wikimedia.org/T147199) [13:01:51] (03PS2) 10Elukey: Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 [13:02:41] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#3769492 (10Marostegui) [13:04:02] 10Operations, 10DBA: Investigate dropping "edit_page_tracking" database table from Wikimedia wikis after archiving it - https://phabricator.wikimedia.org/T57385#3769495 (10Marostegui) @MZMcBride I assume this table is to be dropped, right? So I can update its entry on T57385 "Removable" row saying "YES" ? [13:04:54] (03PS3) 10BBlack: ssl_ciphersuite: dump 3DES on 2017-11-17 [puppet] - 10https://gerrit.wikimedia.org/r/384578 (https://phabricator.wikimedia.org/T147199) [13:05:56] (03CR) 10Muehlenhoff: Initial debianization (032 comments) [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 (owner: 10Elukey) [13:06:02] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: dump 3DES on 2017-11-17 [puppet] - 10https://gerrit.wikimedia.org/r/384578 (https://phabricator.wikimedia.org/T147199) (owner: 10BBlack) [13:07:07] (03PS1) 10Cmjohnson: adding dhcpd entries for db1109-10 T180700 [puppet] - 10https://gerrit.wikimedia.org/r/392038 [13:08:26] (03PS2) 10Cmjohnson: adding dhcpd entries for db1109-10 T180700 [puppet] - 10https://gerrit.wikimedia.org/r/392038 [13:08:59] (03CR) 10Cmjohnson: [C: 032] adding dhcpd entries for db1109-10 T180700 [puppet] - 10https://gerrit.wikimedia.org/r/392038 (owner: 10Cmjohnson) [13:10:31] PROBLEM - puppet last run on mw2174 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/furl] [13:11:01] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3769508 (10Cmjohnson) [13:15:12] (03PS1) 10Filippo Giunchedi: prometheus: add mtail/exim jobs [puppet] - 10https://gerrit.wikimedia.org/r/392039 (https://phabricator.wikimedia.org/T179565) [13:24:51] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active, ASunknown/IPv4: Connect [13:25:30] mmmmm [13:25:40] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [13:25:42] this has been flapping the whole day --^ [13:26:19] but there is work in equinix chicago atm [13:26:26] that explains it probably :D [13:27:32] (03CR) 10Faidon Liambotis: [C: 04-1] [WIP] Have every rdns advertise a private anycast VIP (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/391149 (owner: 10Ayounsi) [13:29:23] (03CR) 10Elukey: Initial debianization (032 comments) [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 (owner: 10Elukey) [13:30:27] (03PS3) 10Elukey: Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 [13:32:08] (03CR) 10BBlack: [WIP] Have every rdns advertise a private anycast VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391149 (owner: 10Ayounsi) [13:34:22] Is there something up with interwikis? [13:34:58] (03CR) 10BBlack: [WIP] Have every rdns advertise a private anycast VIP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391149 (owner: 10Ayounsi) [13:40:31] RECOVERY - puppet last run on mw2174 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:44:06] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#3769553 (10Cmjohnson) @faidon db1018 and db1022 are confirmed in racktables but are both decommissioned and removed from the rack. I updated the following asset tags with the purchase and warranty expirat... [13:55:17] !log uploaded boost 1.55.0+dfsg-3+wmf1+icu57 to apt.wikimedia.org for jessie-wikimedia/component/icu57 (needed for HHVM build linked against ICU 57) [13:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:48] (03PS4) 10Elukey: Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 [13:59:15] (03PS2) 10Rush: Drop pmtpa IP address from role::labs::nfs::misc::dump_servers_ips [puppet] - 10https://gerrit.wikimedia.org/r/392014 (owner: 10Muehlenhoff) [13:59:28] chasemp: haha [13:59:34] or moritzm I guess [13:59:51] (03CR) 10Rush: [C: 032] Drop pmtpa IP address from role::labs::nfs::misc::dump_servers_ips [puppet] - 10https://gerrit.wikimedia.org/r/392014 (owner: 10Muehlenhoff) [14:00:11] pmtpa references still kicking around :) [14:01:20] (03Abandoned) 10Rush: Use maniphest.edit in phab_epipe.py [puppet] - 10https://gerrit.wikimedia.org/r/357354 (https://phabricator.wikimedia.org/T159043) (owner: 1020after4) [14:02:19] (03PS1) 10Elukey: Initial debianization [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392043 [14:04:32] !log labstore1003 service nfs-kernel-server restart && service rsync start [14:04:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:18] (03Abandoned) 10Elukey: Initial debianization [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/392043 (owner: 10Elukey) [14:06:26] (03PS5) 10Elukey: Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 [14:06:56] !log deploy master events to db1070 [14:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:11] (03PS7) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [14:12:36] (03CR) 10jerkins-bot: [V: 04-1] labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) (owner: 10Rush) [14:13:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3769614 (10Cmjohnson) [14:14:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Rack and setup db1109 and db1110 - https://phabricator.wikimedia.org/T180700#3766852 (10Cmjohnson) a:03Marostegui @marostegui These are ready for you [14:14:32] (03PS6) 10Elukey: Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 [14:15:26] (03CR) 10Elukey: "elukey@boron:~/druid_exporter$ WIKIMEDIA=yes ARCH=amd64 DIST=stretch gbp buildpackage --git-debian-branch=debian -us -uc --git-builder=git" [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 (owner: 10Elukey) [14:15:46] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apache2/conf-available/00-defaults.conf] [14:19:55] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:20:36] PROBLEM - puppet last run on mw2216 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ganglia/conf.d/hhvm_mem.pyconf] [14:21:35] (03PS3) 10Rush: phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 [14:22:47] 10Operations, 10ops-eqiad, 10DBA: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3769649 (10Marostegui) [14:22:48] (03PS1) 10Jcrespo: haproxy: Migrate dbproxy1004 to stretch (haproxy 1.7) format [puppet] - 10https://gerrit.wikimedia.org/r/392044 (https://phabricator.wikimedia.org/T156844) [14:23:08] (03PS1) 10Marostegui: install_server: Allow install db1111 and db1112 [puppet] - 10https://gerrit.wikimedia.org/r/392045 (https://phabricator.wikimedia.org/T180788) [14:23:52] (03CR) 10Marostegui: [C: 032] install_server: Allow install db1111 and db1112 [puppet] - 10https://gerrit.wikimedia.org/r/392045 (https://phabricator.wikimedia.org/T180788) (owner: 10Marostegui) [14:25:42] (03PS8) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [14:25:54] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#3769660 (10faidon) >>! In T150651#3769553, @Cmjohnson wrote: > @faidon > db1018 and db1022 are confirmed in racktables but are both decommissioned and removed from the rack. > > I updated the following as... [14:27:06] (03PS4) 10Rush: phab: remove obsolete portions of email handler [puppet] - 10https://gerrit.wikimedia.org/r/391969 [14:28:26] PROBLEM - HHVM rendering on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:29:25] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 79220 bytes in 0.303 second response time [14:33:35] (03PS1) 10Alexandros Kosiaris: Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/392047 [14:34:03] (03CR) 10jerkins-bot: [V: 04-1] Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/392047 (owner: 10Alexandros Kosiaris) [14:37:55] (03PS2) 10Dzahn: keyholder: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [14:39:23] (03CR) 10Dzahn: "will it still be possible to create tickets via mail to task@ after this?" [puppet] - 10https://gerrit.wikimedia.org/r/391969 (owner: 10Rush) [14:40:22] (03PS2) 10Alexandros Kosiaris: Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/392047 (https://phabricator.wikimedia.org/T177254) [14:40:50] (03CR) 10Dzahn: [C: 032] keyholder: Use systemd::tmpfile [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [14:41:08] (03PS3) 10Alexandros Kosiaris: Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/392047 (https://phabricator.wikimedia.org/T177254) [14:41:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "puppet: point codfw mw systems at puppet 4 master puppetmaster2001" [puppet] - 10https://gerrit.wikimedia.org/r/392047 (https://phabricator.wikimedia.org/T177254) (owner: 10Alexandros Kosiaris) [14:41:24] (03PS2) 10Jcrespo: haproxy: Migrate dbproxy1004 to stretch (haproxy 1.7) format [puppet] - 10https://gerrit.wikimedia.org/r/392044 (https://phabricator.wikimedia.org/T156844) [14:43:25] (03PS7) 10Elukey: Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 [14:43:58] (03CR) 10Dzahn: "no-op confirmed on tin/naos/sarin..." [puppet] - 10https://gerrit.wikimedia.org/r/386621 (owner: 10Muehlenhoff) [14:45:51] RECOVERY - puppet last run on mw2235 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:45:51] PROBLEM - cassandra-b service on restbase2006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [14:45:51] PROBLEM - cassandra-c service on restbase2006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [14:45:51] PROBLEM - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.51 and port 9042: Connection refused [14:46:06] (03CR) 10Elukey: "There was a problem with the absence of the init file, namely some init.d related code added to post-inst. Updated rules and now this is t" [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 (owner: 10Elukey) [14:46:11] PROBLEM - cassandra-c SSL 10.192.48.51:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:46:21] PROBLEM - Check systemd state on restbase2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:22] PROBLEM - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.50 and port 9042: Connection refused [14:46:27] restbase2006 is bootstrapping, will downtime [14:46:31] PROBLEM - cassandra-b SSL 10.192.48.50:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [14:48:12] (03CR) 10Dzahn: "also see https://gerrit.wikimedia.org/r/#/c/353599/" [puppet] - 10https://gerrit.wikimedia.org/r/386752 (owner: 10RobH) [14:48:54] (03Abandoned) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/383375 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [14:49:13] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3769759 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['dbproxy1004.eqiad.wmnet'] ``` and were **ALL** successful. [14:49:52] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:50:41] RECOVERY - puppet last run on mw2216 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:53:48] (03PS1) 10BBlack: Revert "tlsproxy: drop ssl_session_timeout to 4h" [puppet] - 10https://gerrit.wikimedia.org/r/392050 [14:54:22] PROBLEM - Nginx local proxy to apache on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:54:26] (03CR) 10BBlack: [C: 032] Revert "tlsproxy: drop ssl_session_timeout to 4h" [puppet] - 10https://gerrit.wikimedia.org/r/392050 (owner: 10BBlack) [14:55:12] RECOVERY - Nginx local proxy to apache on mw2205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.193 second response time [14:57:14] (03CR) 10Dzahn: monitor dataset hosts for nfsd cpu usage (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363548 (owner: 10ArielGlenn) [14:59:16] (03PS2) 10Dzahn: tools: Add paws nodes to clush [puppet] - 10https://gerrit.wikimedia.org/r/360397 (https://phabricator.wikimedia.org/T167086) (owner: 10Yuvipanda) [14:59:18] godog: is that just expired downtime re cass on 2006 ^ ? [14:59:26] (03PS3) 10Dzahn: monitor dataset hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [14:59:52] (03PS4) 10Dzahn: datasets: monitor hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [14:59:57] mobrovac: yeah, I renewed the downtime [15:00:02] (03PS5) 10Dzahn: datasets: monitor hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [15:00:09] kk thnx godog [15:00:56] (03PS6) 10Dzahn: datasets: monitor hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (owner: 10ArielGlenn) [15:01:00] I completely forgot about that patchset... I think I had the vague impression that others had done it better in a different commit [15:01:11] (03CR) 10jerkins-bot: [V: 04-1] datasets: monitor hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (owner: 10ArielGlenn) [15:02:08] (03CR) 10Rush: "task@ is a native phab function now which did not fully exist when all this was created :)" [puppet] - 10https://gerrit.wikimedia.org/r/391969 (owner: 10Rush) [15:03:08] (03CR) 10Dzahn: datasets: monitor hosts for nfsd cpu usage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363548 (owner: 10ArielGlenn) [15:03:39] (03CR) 10Dzahn: "cool, that's what i wanted to confirm :) thx" [puppet] - 10https://gerrit.wikimedia.org/r/391969 (owner: 10Rush) [15:04:35] (03PS1) 10Elukey: role::druid::*: add configuration for the Prometheus Druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392052 (https://phabricator.wikimedia.org/T177459) [15:04:41] (03CR) 10Dzahn: [C: 032] tools: Add paws nodes to clush [puppet] - 10https://gerrit.wikimedia.org/r/360397 (https://phabricator.wikimedia.org/T167086) (owner: 10Yuvipanda) [15:05:08] (03CR) 10jerkins-bot: [V: 04-1] role::druid::*: add configuration for the Prometheus Druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392052 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [15:05:51] 10Operations, 10Traffic, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#3769819 (10BBlack) [15:05:54] 10Operations, 10Traffic, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3769817 (10BBlack) 05Open>03Resolved No 3DES connections or saved sessions left on the pu... [15:06:21] (03PS7) 10Dzahn: datasets: monitor hosts for nfsd cpu usage [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [15:06:30] 10Operations, 10Traffic: Remove 3DES patch from OpenSSL builds - https://phabricator.wikimedia.org/T180792#3769825 (10BBlack) [15:06:32] (03CR) 10Dzahn: "fixing commit message / bug link" [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [15:07:30] (03PS2) 10Elukey: role::druid::*: add configuration for the Prometheus Druid exporter [puppet] - 10https://gerrit.wikimedia.org/r/392052 (https://phabricator.wikimedia.org/T177459) [15:09:40] (03CR) 10Dzahn: "i would recommend for nagios plugins in general to always end with an "exit 3 / UNKNOWN" and do the "exit 0 / OK" explicitly above if a co" [puppet] - 10https://gerrit.wikimedia.org/r/363548 (https://phabricator.wikimedia.org/T169680) (owner: 10ArielGlenn) [15:10:21] (03CR) 10Jcrespo: [C: 032] haproxy: Migrate dbproxy1004 to stretch (haproxy 1.7) format [puppet] - 10https://gerrit.wikimedia.org/r/392044 (https://phabricator.wikimedia.org/T156844) (owner: 10Jcrespo) [15:10:27] (03PS3) 10Jcrespo: haproxy: Migrate dbproxy1004 to stretch (haproxy 1.7) format [puppet] - 10https://gerrit.wikimedia.org/r/392044 (https://phabricator.wikimedia.org/T156844) [15:10:36] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/8835/" [puppet] - 10https://gerrit.wikimedia.org/r/392052 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [15:15:24] (03CR) 10Filippo Giunchedi: [C: 031] Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 (owner: 10Elukey) [15:26:32] !log Starting restbase2006-c w/ -Dcassandra.replace_address=10.192.48.51 (T179422) [15:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:40] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [15:27:15] RECOVERY - cassandra-c service on restbase2006 is OK: OK - cassandra-c is active [15:27:35] RECOVERY - cassandra-c SSL 10.192.48.51:7001 on restbase2006 is OK: SSL OK - Certificate restbase2006-c valid until 2018-08-17 16:12:05 +0000 (expires in 273 days) [15:27:46] (03PS4) 10ArielGlenn: move remaining hardcoded paths for base dir of misc dumps out to profile [puppet] - 10https://gerrit.wikimedia.org/r/381838 (https://phabricator.wikimedia.org/T175528) [15:30:42] (03PS2) 10Rush: Remove allowances for IPs that are no longer in use [puppet] - 10https://gerrit.wikimedia.org/r/374665 (owner: 10Alex Monk) [15:32:13] (03CR) 10Rush: [C: 032] Remove allowances for IPs that are no longer in use [puppet] - 10https://gerrit.wikimedia.org/r/374665 (owner: 10Alex Monk) [15:33:05] (03PS1) 10Muehlenhoff: Remove enable-weak-ssl-ciphers again, we temporarily enabled it to allow 3des (and that is now disabled) (Bug: T180792) [debs/openssl11] - 10https://gerrit.wikimedia.org/r/392056 (https://phabricator.wikimedia.org/T180792) [15:33:39] (03CR) 10Elukey: [V: 032 C: 032] Initial debianization [software/druid_exporter] (debian) - 10https://gerrit.wikimedia.org/r/392023 (owner: 10Elukey) [15:34:43] (03CR) 10Mforns: "LGTM! But I have one question, see comment :]" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [15:36:03] (03PS9) 10Rush: labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) [15:36:47] (03CR) 10Rush: [C: 032] labstore: rsync server on misc (dumps hosting) [puppet] - 10https://gerrit.wikimedia.org/r/391824 (https://phabricator.wikimedia.org/T165136) (owner: 10Rush) [15:40:18] (03CR) 10Elukey: profile::mariadb::misc::eventlogging:replication: add EL sanitization cron (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [15:41:33] (03CR) 10Mforns: [C: 031] "LGTM!!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391828 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [15:43:49] (03PS1) 10DCausse: [cirrus] disable token count router [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392057 (https://phabricator.wikimedia.org/T180795) [15:43:51] (03PS5) 10ArielGlenn: move remaining hardcoded paths for base dir of misc dumps out to profile [puppet] - 10https://gerrit.wikimedia.org/r/381838 (https://phabricator.wikimedia.org/T175528) [15:43:57] !log labservices1001:~# mv /var/zones/tools.eqiad.wmflabs /home/rush T180797 [15:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:04] T180797: labservices1001 manual zone file /var/zones/tools.eqiad.wmflabs - https://phabricator.wikimedia.org/T180797 [15:44:13] (03CR) 10jerkins-bot: [V: 04-1] move remaining hardcoded paths for base dir of misc dumps out to profile [puppet] - 10https://gerrit.wikimedia.org/r/381838 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [15:49:06] 10Operations, 10Traffic, 10Browser-Support-Internet-Explorer, 10Patch-For-Review, 10User-notice: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#3769991 (10BBlack) [15:49:10] 10Operations, 10Traffic, 10Community-Liaisons (Oct-Dec 2017), 10Patch-For-Review, 10User-Johan: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3769989 (10BBlack) 05Open>03Resolved Done here I think... [15:51:52] (03CR) 10Hashar: [C: 031] [cirrus] disable token count router [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392057 (https://phabricator.wikimedia.org/T180795) (owner: 10DCausse) [15:56:28] fyi: I'm deploying https://gerrit.wikimedia.org/r/#/c/392057/ from tin [15:56:35] (03CR) 10DCausse: [C: 032] [cirrus] disable token count router [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392057 (https://phabricator.wikimedia.org/T180795) (owner: 10DCausse) [15:57:38] (03CR) 10Zoranzoki21: [C: 031] "> You should schedule this for SWAT if you want this merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [15:57:50] (03CR) 10Zoranzoki21: [C: 031] [cirrus] disable token count router [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392057 (https://phabricator.wikimedia.org/T180795) (owner: 10DCausse) [15:58:17] (03PS12) 10Zoranzoki21: Enable the ArticlePlaceholder for sewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/387077 (https://phabricator.wikimedia.org/T179241) [15:59:17] (03Merged) 10jenkins-bot: [cirrus] disable token count router [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392057 (https://phabricator.wikimedia.org/T180795) (owner: 10DCausse) [15:59:26] (03CR) 10jenkins-bot: [cirrus] disable token count router [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392057 (https://phabricator.wikimedia.org/T180795) (owner: 10DCausse) [16:04:51] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: T180795 [cirrus] disable token count router (duration: 00m 49s) [16:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:57] T180795: Elastic 5.5 rolling restart causes some search queries to fail - https://phabricator.wikimedia.org/T180795 [16:05:03] ;D [16:05:46] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: Use of uninitialized value duration in numeric gt () at /usr/lib/nagios/plugins/check_bgp line 316. [16:05:49] (03PS6) 10ArielGlenn: move remaining hardcoded paths for base dir of misc dumps out to profile [puppet] - 10https://gerrit.wikimedia.org/r/381838 (https://phabricator.wikimedia.org/T175528) [16:06:05] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [16:06:06] XioNoX: FYI ^^^ [16:06:56] volans: thx, seems like snmp timeout [16:07:12] let me check those two down [16:08:20] XioNoX: from my logs it flapped 15 times the alarm for cr1-eqdfw starting at 23:15 UTC tonight [16:08:49] volans: that specific alert? [16:08:49] volans: there were multiple maintenance windows announced, not sure if related or not [16:08:58] (Telia + Equinix Chicago) [16:09:14] (03PS10) 10BBlack: Fully normalize upload paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T171881) [16:09:48] 1 Juniper alarms, 5 BGP status, the rest Router interfaces [16:09:58] seems like it [16:10:01] I'll dig more, thx [16:10:49] thank you! [16:10:52] (03CR) 10Zoranzoki21: [C: 031] "> > You should schedule this for SWAT if you want this merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) (owner: 10TerraCodes) [16:13:30] (03CR) 10BBlack: [C: 031] "I've tested this extensively by extracting the meat of it as a CLI program (replacing the Varnish-specific bits for getting/setting the UR" [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T171881) (owner: 10BBlack) [16:19:08] !log uploaded boost 1.55.0+dfsg-3+wmf2+icu57 to apt.wikimedia.org for jessie-wikimedia/component/icu57 (needed for HHVM build linked against ICU 57) [16:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:14] !log disabling puppet on all cp* (testing encoding patch) [16:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:38] (03CR) 10BBlack: [C: 032] Fully normalize upload paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T171881) (owner: 10BBlack) [16:28:32] (03PS1) 10Rush: openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) [16:28:59] (03PS2) 10Rush: openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) [16:29:33] (03CR) 10jerkins-bot: [V: 04-1] openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:30:46] PROBLEM - puppet last run on cp4026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 9 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file],Exec[retry-load-new-vcl-file-frontend] [16:34:13] (03PS1) 10Rush: labstore: fix rsync rule for misc [puppet] - 10https://gerrit.wikimedia.org/r/392063 (https://phabricator.wikimedia.org/T165136) [16:34:24] (03PS7) 10ArielGlenn: move remaining hardcoded paths for base dir of misc dumps out to profile [puppet] - 10https://gerrit.wikimedia.org/r/381838 (https://phabricator.wikimedia.org/T175528) [16:35:29] (03CR) 10ArielGlenn: [C: 032] move remaining hardcoded paths for base dir of misc dumps out to profile [puppet] - 10https://gerrit.wikimedia.org/r/381838 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [16:35:30] (03PS3) 10Rush: openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) [16:35:50] (03PS2) 10Rush: labstore: fix rsync rule for misc [puppet] - 10https://gerrit.wikimedia.org/r/392063 (https://phabricator.wikimedia.org/T165136) [16:36:32] (03PS4) 10Rush: openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) [16:39:26] (03PS1) 10BBlack: normalization followup fix for $cluster... [puppet] - 10https://gerrit.wikimedia.org/r/392065 [16:41:40] (03CR) 10BBlack: [C: 032] normalization followup fix for $cluster... [puppet] - 10https://gerrit.wikimedia.org/r/392065 (owner: 10BBlack) [16:43:57] (03PS5) 10Rush: openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) [16:45:00] (03PS6) 10Rush: openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) [16:45:45] RECOVERY - puppet last run on cp4026 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:47:07] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikimedia-Portals, and 2 others: Requesting deployment access for jdrewniak - https://phabricator.wikimedia.org/T180639#3770220 (10debt) [16:50:00] (03CR) 10Rush: [C: 032] openstack: move ferm rules out of site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392062 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [16:52:12] !log enabling new normalization code for all upload@ulsfo [16:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:59] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Toolforge's static webserver broken by Puppet changes and stale nginx packages - https://phabricator.wikimedia.org/T175885#3606725 (10aborrero) A side note. When deploying T177920 into the tools cluster, we found some issue with `unat... [17:00:00] (03PS1) 10Marostegui: mariadb: Add db1109 and db1110 to s5/s8 [puppet] - 10https://gerrit.wikimedia.org/r/392068 (https://phabricator.wikimedia.org/T180700) [17:03:13] (03PS1) 10Rush: openstack: update notes and descriptions in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392069 [17:04:53] (03CR) 10Marostegui: [C: 032] mariadb: Add db1109 and db1110 to s5/s8 [puppet] - 10https://gerrit.wikimedia.org/r/392068 (https://phabricator.wikimedia.org/T180700) (owner: 10Marostegui) [17:08:02] (03CR) 10Madhuvishy: [C: 032] openstack: update notes and descriptions in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392069 (owner: 10Rush) [17:10:11] (03PS1) 10ArielGlenn: Prep for moving misc dump cron jobs to dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/392070 (https://phabricator.wikimedia.org/T179942) [17:12:37] !log enabling new normalization code for all upload@codfw [17:12:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:51] !log Move dbstore1001 under db1070 - T180714 [17:12:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:57] T180714: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714 [17:14:04] RECOVERY - MariaDB Slave IO: s5 on dbstore1001 is OK: OK slave_io_state Slave_IO_Running: Yes [17:14:51] !log Revert schema change on dbstore1001 - T180714 [17:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:32] !log enabling new normalization code for all upload@eqiad [17:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:54] 10Operations, 10monitoring: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812#3770423 (10ayounsi) [17:21:18] !log enabling new normalization code for all upload@esams (done) [17:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:29] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3770456 (10Marostegui) [17:23:36] 10Operations, 10Cloud-Services: username in use when creating account at wikitech - https://phabricator.wikimedia.org/T180813#3770464 (10Flominator) [17:27:37] (03CR) 10Ayounsi: "The pmacct part looks good to me, but I don't know enough about Kafka to +1 it." [puppet] - 10https://gerrit.wikimedia.org/r/392007 (https://phabricator.wikimedia.org/T173489) (owner: 10Elukey) [17:31:34] PROBLEM - Nginx local proxy to apache on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:31:35] (03PS2) 10Rush: openstack: update notes and descriptions in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/392069 [17:32:24] RECOVERY - Nginx local proxy to apache on mw2210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.196 second response time [17:36:04] (03PS2) 10ArielGlenn: Prep for moving misc dump cron jobs to dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/392070 (https://phabricator.wikimedia.org/T179942) [17:36:33] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3107358 (10mforns) Re @elukey Seems to me that the DROP DATABASE list is correct. To the list of databases to review I would add: akh... [17:37:13] (03CR) 10ArielGlenn: [C: 032] Prep for moving misc dump cron jobs to dumpsdata host [puppet] - 10https://gerrit.wikimedia.org/r/392070 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [17:40:52] 10Operations, 10DBA, 10Patch-For-Review, 10Wikimedia-Incident: s5 primary master db1063 crashed - https://phabricator.wikimedia.org/T180714#3770570 (10Marostegui) [17:41:48] !log demon@tin Synchronized php-1.31.0-wmf.8/extensions/Wikibase/client/WikibaseClient.php: fix client dependencies (duration: 00m 50s) [17:41:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:31] !log demon@tin Synchronized php-1.31.0-wmf.8/extensions/Wikidata/extensions/Wikibase/client/WikibaseClient.php: fix client dependencies (duration: 00m 49s) [17:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:27] (03CR) 10Muehlenhoff: [C: 032] Remove enable-weak-ssl-ciphers again, we temporarily enabled it to allow 3des (and that is now disabled) (Bug: T180792) [debs/openssl11] - 10https://gerrit.wikimedia.org/r/392056 (https://phabricator.wikimedia.org/T180792) (owner: 10Muehlenhoff) [17:52:42] 10Operations, 10Traffic: Remove 3DES patch from OpenSSL builds - https://phabricator.wikimedia.org/T180792#3770645 (10MoritzMuehlenhoff) I've commited this to git, it doesn't warrant to roll new packages for this change alone, this can be piggybacked with the next openssl security update. [17:55:59] 10Operations, 10DC-Ops: Information missing from racktables - https://phabricator.wikimedia.org/T150651#3770659 (10RobH) >>! In T150651#3769660, @faidon wrote: > Thanks @Cmjohnson, that was fast! So remaining are: > - WMF7011 - WMF7015 elastic10NN (@Cmjohnson?) updated phab (had wrong sub task listed) and adde... [17:58:40] 10Operations, 10ops-eqiad: fix hostname for fmsw-eqiad to fmsw-c1-eqiad - https://phabricator.wikimedia.org/T180821#3770661 (10RobH) [18:04:03] (03Draft1) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [18:04:07] (03PS2) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [18:05:05] PROBLEM - Router interfaces on cr1-eqdfw is CRITICAL: CRITICAL: No response from remote host 208.80.153.198 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [18:06:04] RECOVERY - Router interfaces on cr1-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0 [18:08:54] 10Operations, 10Analytics, 10DBA, 10Patch-For-Review, 10User-Elukey: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3770727 (10DarTar) I just spoke to @JAllemandou, we can help review these legacy tables early next week. [18:19:19] 10Operations, 10ops-ulsfo, 10Traffic: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815#3770808 (10RobH) 05Open>03stalled [18:19:21] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3770809 (10RobH) [18:23:47] (03PS3) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [18:24:06] (03PS1) 10Anomie: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392082 (https://phabricator.wikimedia.org/T166733) [18:30:11] (03PS15) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [18:31:21] (03Draft1) 10Paladox: Gerrit: Enable logstash for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/392083 (https://phabricator.wikimedia.org/T141324) [18:31:25] (03PS2) 10Paladox: Gerrit: Enable logstash for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/392083 (https://phabricator.wikimedia.org/T141324) [18:31:59] (03PS2) 10Anomie: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392082 (https://phabricator.wikimedia.org/T166733) [18:32:18] (03PS3) 10Paladox: Gerrit: Enable logstash for gerrit [puppet] - 10https://gerrit.wikimedia.org/r/392083 (https://phabricator.wikimedia.org/T141324) [18:32:31] (03CR) 10Legoktm: [C: 031] Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392082 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [18:32:45] (03PS1) 10Dzahn: smokeping: switch data rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/392084 (https://phabricator.wikimedia.org/T180812) [18:33:19] (03CR) 10Anomie: [C: 032] "For deployment to Beta Cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392082 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [18:35:39] (03Merged) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392082 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [18:36:44] !log anomie@tin Synchronized wmf-config/InitialiseSettings-labs.php: Setting wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on Beta Cluster, no prod change (duration: 00m 49s) [18:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:33] !log anomie@tin Synchronized wmf-config/InitialiseSettings.php: Setting wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on Beta Cluster, no prod change (duration: 00m 48s) [18:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:10] (03PS1) 10ArielGlenn: set up directory for misc cron dumps on dumpsdata, move first cron job [puppet] - 10https://gerrit.wikimedia.org/r/392085 (https://phabricator.wikimedia.org/T179942) [18:42:14] (03PS4) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [18:48:05] (03PS2) 10ArielGlenn: set up directory for misc cron dumps on dumpsdata, move first cron job [puppet] - 10https://gerrit.wikimedia.org/r/392085 (https://phabricator.wikimedia.org/T179942) [18:50:27] (03PS1) 10Dzahn: smokeping: switch backend to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/392086 (https://phabricator.wikimedia.org/T180812) [18:52:04] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [18:52:54] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [18:53:13] (03CR) 10jenkins-bot: Set wgCommentTableSchemaMigrationStage = MIGRATION_WRITE_BOTH on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392082 (https://phabricator.wikimedia.org/T166733) (owner: 10Anomie) [18:53:56] (03PS16) 10TerraCodes: Remove overlapping userrights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370791 (https://phabricator.wikimedia.org/T101983) [18:54:28] (03PS3) 10ArielGlenn: set up directory for misc cron dumps on dumpsdata, move first cron job [puppet] - 10https://gerrit.wikimedia.org/r/392085 (https://phabricator.wikimedia.org/T179942) [18:58:39] (03PS1) 10Rush: openstack: labsaliases extra records move to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/392087 (https://phabricator.wikimedia.org/T171494) [19:04:18] (03PS5) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [19:06:56] (03CR) 10Paladox: Gerrit: Fix up logstash configuation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [19:08:51] 10Operations, 10monitoring, 10Patch-For-Review: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812#3770974 (10Dzahn) a:03Dzahn [19:10:37] (03PS6) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [19:12:59] (03PS1) 10Dzahn: smokeping: enable cron to auto-rsync data [puppet] - 10https://gerrit.wikimedia.org/r/392090 (https://phabricator.wikimedia.org/T180812) [19:13:01] (03PS7) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [19:16:00] (03PS8) 10Paladox: Gerrit: Fix up logstash configuation [puppet] - 10https://gerrit.wikimedia.org/r/392079 (https://phabricator.wikimedia.org/T141324) [19:19:04] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8845/" [puppet] - 10https://gerrit.wikimedia.org/r/392087 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [19:20:58] (03CR) 10Rush: [C: 032] openstack: labsaliases extra records move to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/392087 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [19:27:19] (03PS1) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [19:27:34] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [19:29:25] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [19:34:39] (03CR) 10ArielGlenn: [C: 032] set up directory for misc cron dumps on dumpsdata, move first cron job [puppet] - 10https://gerrit.wikimedia.org/r/392085 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [19:34:46] (03PS4) 10ArielGlenn: set up directory for misc cron dumps on dumpsdata, move first cron job [puppet] - 10https://gerrit.wikimedia.org/r/392085 (https://phabricator.wikimedia.org/T179942) [19:36:30] !log @netmon1002:/var/lib/smokeping# rsync -avp /var/lib/smokeping/ /root/backup/netmon1002/201711717/var/lib/smokeping/ [19:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:57] !log T180812 - @netmon1002:# rsync -avp /var/lib/smokeping/ /root/backup/netmon1002/201711717/var/lib/smokeping/@netmon2001:/# rsync -avp /var/lib/smokeping/ /root/backup/netmon2001/201711717/var/lib/smokeping/ [19:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:04] T180812: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812 [19:41:04] PROBLEM - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:41:05] PROBLEM - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused [19:45:52] ^^^ whoops, that should have been under maintenance [19:47:52] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.136:9042 on restbase1014 is CRITICAL: connect to address 10.64.48.136 and port 9042: Connection refused eevans Decommissioned. [19:47:52] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.136:7001 on restbase1014 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused eevans Decommissioned. [19:49:27] (03PS2) 10Dzahn: smokeping: switch data rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/392084 (https://phabricator.wikimedia.org/T180812) [19:50:07] (03PS3) 10Dzahn: smokeping: switch data rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/392084 (https://phabricator.wikimedia.org/T180812) [19:51:13] (03PS4) 10Dzahn: smokeping: switch data rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/392084 (https://phabricator.wikimedia.org/T180812) [19:52:56] (03PS2) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [19:53:12] (03PS3) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [19:53:45] (03CR) 10Dzahn: [C: 032] "it's still a human running the sync command until we set auto_sync to true, this just does ferm/rsyncd conf to enable running it" [puppet] - 10https://gerrit.wikimedia.org/r/392084 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [20:02:18] !log T180812 copying smokeping data from 2001 to 1002 - netmon1002: /usr/bin/rsync -avp rsync://netmon2001.wikimedia.org/var-lib-smokeping /var/lib/smokeping/ | switching backend from codfw to eqiad [20:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:26] T180812: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812 [20:02:27] (03PS2) 10Dzahn: smokeping: switch backend to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/392086 (https://phabricator.wikimedia.org/T180812) [20:03:30] (03CR) 10Dzahn: [C: 032] smokeping: switch backend to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/392086 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [20:03:44] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 23 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:05:34] !log running puppet on all cache misc to switch smokeping web to eqiad [20:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:09] (03PS1) 10Chad: Move role::gerrit::server to just role::gerrit [puppet] - 10https://gerrit.wikimedia.org/r/392095 [20:12:43] (03CR) 10Chad: Move role::gerrit::server to just role::gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392095 (owner: 10Chad) [20:12:59] (03CR) 10Krinkle: [C: 031] webperf: Add missing mediaWikiLoad to navtiming2 [puppet] - 10https://gerrit.wikimedia.org/r/392036 (https://phabricator.wikimedia.org/T180598) (owner: 10Phedenskog) [20:14:14] PROBLEM - Juniper alarms on cr1-eqdfw is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 208.80.153.198 [20:15:28] (03PS2) 10Dzahn: smokeping: enable cron to auto-rsync data [puppet] - 10https://gerrit.wikimedia.org/r/392090 (https://phabricator.wikimedia.org/T180812) [20:15:55] RECOVERY - Juniper alarms on cr1-eqdfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [20:21:57] (03PS1) 10Dzahn: Revert "smokeping: switch data rsync direction" [puppet] - 10https://gerrit.wikimedia.org/r/392098 [20:22:06] (03PS1) 10ArielGlenn: cron jobs on the dumpsdatahosts use the new dumpsgen user [puppet] - 10https://gerrit.wikimedia.org/r/392099 (https://phabricator.wikimedia.org/T179942) [20:22:56] (03CR) 10Dzahn: "This revert is as planned, sync it once, switch backend, (make it possible to) start syncing in other direction. As before this" [puppet] - 10https://gerrit.wikimedia.org/r/392098 (owner: 10Dzahn) [20:23:44] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:24:28] (03CR) 10Ayounsi: [C: 031] Revert "smokeping: switch data rsync direction" [puppet] - 10https://gerrit.wikimedia.org/r/392098 (owner: 10Dzahn) [20:24:42] (03CR) 10Ayounsi: [C: 031] smokeping: enable cron to auto-rsync data [puppet] - 10https://gerrit.wikimedia.org/r/392090 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [20:25:45] (03CR) 10ArielGlenn: [C: 032] cron jobs on the dumpsdatahosts use the new dumpsgen user [puppet] - 10https://gerrit.wikimedia.org/r/392099 (https://phabricator.wikimedia.org/T179942) (owner: 10ArielGlenn) [20:26:36] 10Operations, 10monitoring, 10Patch-For-Review: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812#3771200 (10Dzahn) Ok, so far i did: - make a local rsync backup of /var/lib/smokeping into /root/backup/.. on both 1002 and 2001 in case i mess something up - switch the rsync directio... [20:31:55] (03CR) 10Dzahn: Move role::gerrit::server to just role::gerrit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/392095 (owner: 10Chad) [20:35:21] !log netmon1002 - rsync smokeping data back from local backup to show measurements made from eqiad as requested on T180812 [20:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:28] T180812: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812 [20:35:44] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:36:59] 10Operations, 10monitoring, 10Patch-For-Review: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812#3771216 (10ayounsi) Thanks ! > Do we want to declare the current web backend the "active server" and disable the smokeping service on that one and constantly rsync data from there to th... [20:37:07] 10Operations, 10monitoring, 10Patch-For-Review: Switch smokeping back to eqiad - https://phabricator.wikimedia.org/T180812#3771217 (10Dzahn) >>! In T180812#3771201, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-operations), href=https://tools.wmflabs.org/sal/log/AV_LsWTPwg13V6286EYA} [... [20:40:44] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:47:45] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 25 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:48:08] (03CR) 10Ayounsi: "Comments addressed." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/391149 (owner: 10Ayounsi) [20:48:26] (03PS6) 10Ayounsi: Have every rdns advertise a private anycast VIP [puppet] - 10https://gerrit.wikimedia.org/r/391149 [20:51:23] (03PS4) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [20:52:45] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [20:53:38] (03PS5) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [20:59:54] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:04:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:14:54] (03PS1) 10Ayounsi: Revert "DNS: Only send eqsin countries to ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/392105 [21:15:01] (03PS2) 10Ayounsi: Revert "DNS: Only send eqsin countries to ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/392105 [21:15:52] (03CR) 10Ayounsi: [C: 032] Revert "DNS: Only send eqsin countries to ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/392105 (owner: 10Ayounsi) [21:16:54] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 20 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:20:59] (03CR) 10BBlack: [C: 031] Revert "DNS: Only send eqsin countries to ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/392105 (owner: 10Ayounsi) [21:21:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 18 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:24:09] (03CR) 10Krinkle: [C: 04-1] "Do not merge as this would likely cause immediate data corruption as it isn't meant to run in more than one place and we don't yet have s" [puppet] - 10https://gerrit.wikimedia.org/r/392030 (https://phabricator.wikimedia.org/T179036) (owner: 10Dzahn) [21:30:51] (03CR) 10Krinkle: Fully normalize upload paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T171881) (owner: 10BBlack) [21:36:29] (03CR) 10Chad: "Scap *builds* against the python3 version ok, but we lack any testing on that rare code path to actually do much battle testing for you." [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans) [21:37:10] (03CR) 10Chad: "Also, Scap's a little far off from having Py3 support be totally ready, so kudos for maintaining 2/3 compat :D" [software/conftool] - 10https://gerrit.wikimedia.org/r/387544 (owner: 10Volans) [21:41:02] (03CR) 10BBlack: Fully normalize upload paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T171881) (owner: 10BBlack) [21:47:17] (03CR) 10BBlack: "trivial nitpicks on non-code" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/391149 (owner: 10Ayounsi) [21:56:55] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:02:56] (03CR) 10Dzahn: [C: 031] "lgtm, just coordinate with cloud instances as they will break when role name changes" [puppet] - 10https://gerrit.wikimedia.org/r/392095 (owner: 10Chad) [22:03:25] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Active, ASunknown/IPv4: Active [22:04:04] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 53, down: 2, shutdown: 2 [22:06:55] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:11:04] (03PS6) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [22:11:24] PROBLEM - BGP status on cr1-eqdfw is CRITICAL: BGP CRITICAL - No response from remote host 208.80.153.198 [22:11:49] (03PS8) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [22:12:02] (03PS1) 10Kaldari: Allow admins to remove users from MP3 uploaders user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392166 (https://phabricator.wikimedia.org/T180002) [22:13:14] RECOVERY - BGP status on cr1-eqdfw is OK: BGP OK - up: 55, down: 0, shutdown: 2 [22:16:29] 10Operations, 10Developer-Relations: Bring discourse.mediawiki.org to production - https://phabricator.wikimedia.org/T180853#3771481 (10Qgil) [22:24:43] 10Operations, 10Developer-Relations, 10cloud-services-team: Create discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T180854#3771512 (10Qgil) [22:27:05] 10Operations, 10Developer-Relations, 10cloud-services-team: Create discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T180854#3771525 (10Qgil) @chasemp @bd808 @Tgr your thoughts about things to consider before starting is very welcome (especially looking at the future migration to productio... [22:29:04] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:34:04] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 19 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:34:57] (03PS1) 10Dzahn: smokeping: add $active_server parameter and use it [puppet] - 10https://gerrit.wikimedia.org/r/392167 (https://phabricator.wikimedia.org/T180812) [22:35:29] (03CR) 10jerkins-bot: [V: 04-1] smokeping: add $active_server parameter and use it [puppet] - 10https://gerrit.wikimedia.org/r/392167 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [22:42:23] (03PS2) 10Dzahn: smokeping: add $active_server parameter and use it [puppet] - 10https://gerrit.wikimedia.org/r/392167 (https://phabricator.wikimedia.org/T180812) [22:44:13] (03PS9) 10Rush: openstack: cleanup hiera tree for cloud/labs things [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) [22:44:30] (03CR) 10Dzahn: "adding one style violation with a hiera call from role class but removing 2 by changing the includes, so it's a positive ;)" [puppet] - 10https://gerrit.wikimedia.org/r/392167 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [22:49:47] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/8852/" [puppet] - 10https://gerrit.wikimedia.org/r/392091 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:50:45] (03PS1) 10Rush: openstack: openstack2 => openstack [puppet] - 10https://gerrit.wikimedia.org/r/392168 (https://phabricator.wikimedia.org/T171494) [22:54:51] 10Operations, 10Developer-Relations, 10cloud-services-team: Create discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T180854#3771558 (10Tgr) SSO task is T124691, should probably be a blocker. [23:03:08] (03PS2) 10Dzahn: Revert "smokeping: switch data rsync direction" [puppet] - 10https://gerrit.wikimedia.org/r/392098 [23:03:29] (03CR) 10Dzahn: [C: 032] Revert "smokeping: switch data rsync direction" [puppet] - 10https://gerrit.wikimedia.org/r/392098 (owner: 10Dzahn) [23:05:31] (03CR) 10Ayounsi: [C: 031] smokeping: add $active_server parameter and use it [puppet] - 10https://gerrit.wikimedia.org/r/392167 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [23:13:00] 10Operations, 10Developer-Relations, 10cloud-services-team (Kanban): Create discourse-mediawiki.wmflabs.org - https://phabricator.wikimedia.org/T180854#3771597 (10bd808) [23:19:55] (03CR) 10Dzahn: [C: 032] "oops, did not mean to remove your review, thanks for adding it :)" [puppet] - 10https://gerrit.wikimedia.org/r/392090 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [23:32:28] (03PS3) 10Dzahn: smokeping: enable cron to auto-rsync data [puppet] - 10https://gerrit.wikimedia.org/r/392090 (https://phabricator.wikimedia.org/T180812) [23:32:58] (03CR) 10Dzahn: [C: 032] smokeping: enable cron to auto-rsync data [puppet] - 10https://gerrit.wikimedia.org/r/392090 (https://phabricator.wikimedia.org/T180812) (owner: 10Dzahn) [23:36:15] RECOVERY - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is OK: TCP OK - 0.036 second response time on 10.192.48.51 port 9042 [23:50:44] (03PS1) 10Andrew Bogott: puppet: Move cloud VMs to the puppet 'future' environment [puppet] - 10https://gerrit.wikimedia.org/r/392172 (https://phabricator.wikimedia.org/T178508) [23:51:32] (03CR) 10Andrew Bogott: [C: 04-2] "I will merge this on Monday when I'm able to give it my full attention." [puppet] - 10https://gerrit.wikimedia.org/r/392172 (https://phabricator.wikimedia.org/T178508) (owner: 10Andrew Bogott)