[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T0000). [00:03:01] jouncebot: tell that to my ex. [00:05:10] :( [00:10:45] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [00:20:07] 10Operations, 10Core-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10CCicalese_WMF) [00:21:30] (03PS1) 1020after4: Disable phabricator rate limits [puppet] - 10https://gerrit.wikimedia.org/r/445328 (https://phabricator.wikimedia.org/T198974) [00:22:54] 10Operations, 10Core-Platform-Team, 10WMF-JobQueue, 10monitoring, and 3 others: Collect error logs from jobchron/jobrunner services in Logstash - https://phabricator.wikimedia.org/T172479 (10CCicalese_WMF) [00:24:44] (03PS1) 1020after4: Add phabricator-antivandalism extension to the library path [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) [00:25:12] (03CR) 1020after4: [C: 031] Add phabricator-antivandalism extension to the library path [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [00:27:05] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [00:27:24] 10Operations, 10Core-Platform-Team, 10monitoring: Fix monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10CCicalese_WMF) [00:28:03] 10Operations, 10Core-Platform-Team, 10monitoring: High levels of PoolCounter errors should trigger alerts - https://phabricator.wikimedia.org/T133318 (10CCicalese_WMF) [00:28:39] (03CR) 1020after4: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/11781/phab1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [00:34:56] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [00:36:12] twentyafterfour https://phabricator.wikimedia.org/diffusion/external/?uri=ssh%3A%2F%2Fgit-ssh.wikimedia.org%2Fsource%2Fphabricator-ava.git&id=0f7585bc862556f67469a6f38fe2829508a1586e [00:36:17] Call to a member function generateURI() on null [00:38:18] 10Operations, 10Core-Platform-Team, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support, Q2 goals - https://phabricator.wikimedia.org/T175213 (10CCicalese_WMF) [00:38:33] 10Operations, 10Core-Platform-Team, 10Epic, 10Performance-Team (Radar), 10Services (watching): 2017/18 Annual Plan Program 8: Multi-datacenter support - https://phabricator.wikimedia.org/T175206 (10CCicalese_WMF) [00:39:14] (03CR) 10Paladox: [C: 04-1] "This will fail on phab.wmflabs.org." [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [00:39:22] (03CR) 10Paladox: [C: 04-1] "as https://phabricator.wikimedia.org/source/phabricator-ava.git is private." [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [00:49:32] (03CR) 1020after4: [C: 031] "@paladox: then we should not load this one labs" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [00:50:05] (03CR) 10Paladox: [C: 04-1] "> @paladox: then we should not load this one labs" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [00:51:15] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [00:51:36] (03CR) 1020after4: [C: 031] "we could create a dummy repo in the same location, just a blank libphutil library with nothing in it" [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [00:52:20] 10Operations, 10Core-Platform-Team, 10MediaWiki-Shell: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10CCicalese_WMF) [01:30:55] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [01:36:26] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title}{/revision}{/tid} (Get summary for test page) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of t [01:36:26] for April 29, 2016) timed out before a response was received: /{domain}/v1/page/random/title (retrieve a random article) timed out before a response was received [01:36:45] PROBLEM - Disk space on scb2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:36:55] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve all events for Jan 15) timed out before a response was received [01:36:55] PROBLEM - eventstreams on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:06] PROBLEM - changeprop endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [01:37:06] PROBLEM - apertium apy on scb2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:26] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [01:37:36] RECOVERY - Disk space on scb2003 is OK: DISK OK [01:37:46] RECOVERY - eventstreams on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.097 second response time [01:37:55] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [01:38:06] RECOVERY - apertium apy on scb2003 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.074 second response time [01:38:06] RECOVERY - changeprop endpoints health on scb2003 is OK: All endpoints are healthy [02:02:15] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [02:35:18] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.10) (duration: 14m 44s) [02:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:33] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.12) (duration: 14m 31s) [03:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:59] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Thu Jul 12 03:18:58 UTC 2018 (duration 10m 25s) [03:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:46:46] (03PS1) 10Krinkle: grafana: Remove 'featured' tag from varnish-http-errors dash [puppet] - 10https://gerrit.wikimedia.org/r/445336 [03:46:55] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [03:50:18] (03CR) 10Krinkle: "@Vgutierrez: @Giuseppe: Are the prometheus-prefixed dashboards meant to be transitional, or permanent?" [puppet] - 10https://gerrit.wikimedia.org/r/445336 (owner: 10Krinkle) [04:29:55] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [04:39:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445337 (https://phabricator.wikimedia.org/T146591) [04:41:38] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445337 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [04:42:43] (03PS2) 10Marostegui: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445337 (https://phabricator.wikimedia.org/T146591) [04:44:57] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445337 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [04:46:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445337 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [04:47:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1123 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445337 (https://phabricator.wikimedia.org/T146591) (owner: 10Marostegui) [04:47:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1123 for alter table (duration: 00m 58s) [04:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:01] !log Deploy schema change on db1123 T146591 T197891 T196379 [04:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:06] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [04:48:07] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [04:48:08] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [04:48:28] !log Optimize wbc_entity_usage on bewiki cewiki dawiki hywiki ttwiki on db1123 - T187521 [04:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:31] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [04:53:45] (03PS1) 10Tim Starling: Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/445340 [04:54:24] (03CR) 10Tim Starling: [C: 032] Add myself to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/445340 (owner: 10Tim Starling) [04:56:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445341 [04:58:39] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445341 (owner: 10Marostegui) [05:00:04] !log Deploy schema change on db1075 (s3 primary master)/script unload irssinotifier T146591 T197891 T196379 [05:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:09] T196379: Schema change: Add unique index on archive.ar_rev_id - https://phabricator.wikimedia.org/T196379 [05:00:09] T197891: Schema change to drop default from externallinks.el_index_60 - https://phabricator.wikimedia.org/T197891 [05:00:09] T146591: Add a primary key to l10n_cache - https://phabricator.wikimedia.org/T146591 [05:00:12] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445341 (owner: 10Marostegui) [05:00:24] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1123" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445341 (owner: 10Marostegui) [05:01:23] !log Optimize wbc_entity_usage on bewiki cewiki dawiki hywiki ttwiki on db1075 (s3 primary master) - T187521 [05:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1123 after alter table (duration: 00m 57s) [05:01:26] T187521: Optimize recentchanges and wbc_entity_usage table across wikis - https://phabricator.wikimedia.org/T187521 [05:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:51] 10Operations, 10Goal: Perform a datacenter switchover (2018-19 Q1) - https://phabricator.wikimedia.org/T199073 (10Krinkle) [05:21:15] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [06:15:01] fixing --^ [06:15:09] it seems a recurrent issue [06:15:13] will open a task [06:25:16] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [06:25:26] 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10elukey) p:05Triage>03High [06:25:46] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 1136 days) [06:26:01] 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10elukey) [06:26:08] !log restart rsyslog on wezen - T199406 [06:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:26:11] T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 [06:28:54] 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10elukey) [06:29:56] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/var/lib/apt/keys/ubuntucloud.gpg],File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:30:26] 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10elukey) [06:38:56] !log Drop unused grants from db1061 db1096:3316 db1113:3316 [06:38:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:07] (03PS2) 10Elukey: Correct white-list path for EventLogging sanitization in Hive [puppet] - 10https://gerrit.wikimedia.org/r/445187 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [06:43:15] * elukey waves to marostegui [06:43:23] Hey elukey o/ [06:44:01] (03CR) 10Elukey: [C: 032] Correct white-list path for EventLogging sanitization in Hive [puppet] - 10https://gerrit.wikimedia.org/r/445187 (https://phabricator.wikimedia.org/T193176) (owner: 10Mforns) [07:00:26] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:01:11] (03PS3) 10Muehlenhoff: Add Michele Catasta to users [puppet] - 10https://gerrit.wikimedia.org/r/445153 (https://phabricator.wikimedia.org/T198662) [07:01:46] !log Drop unused grants from dbstore2002:3320 dbstore1001:3311 db2037 db2034 [07:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:08] (03CR) 10Muehlenhoff: [C: 032] Add Michele Catasta to users [puppet] - 10https://gerrit.wikimedia.org/r/445153 (https://phabricator.wikimedia.org/T198662) (owner: 10Muehlenhoff) [07:08:03] (03PS1) 10Marostegui: db1083.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/445349 (https://phabricator.wikimedia.org/T197069) [07:10:17] (03CR) 10Marostegui: [C: 032] db1083.yaml: Change binlog format [puppet] - 10https://gerrit.wikimedia.org/r/445349 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:12:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445350 (https://phabricator.wikimedia.org/T197069) [07:14:32] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445350 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:16:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445350 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:17:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1083 for maintenance (duration: 00m 58s) [07:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:05] (03PS1) 10Muehlenhoff: Add pirroh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/445351 (https://phabricator.wikimedia.org/T198662) [07:18:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445350 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:18:48] !log Stop MySQL on db1083 to pick up new binlog format and MySQL upgrade [07:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:04] (03CR) 10Muehlenhoff: [C: 032] Add pirroh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/445351 (https://phabricator.wikimedia.org/T198662) (owner: 10Muehlenhoff) [07:21:14] (03PS8) 10Volans: wmf-auto-reimage: validate certificate fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/433928 [07:23:30] (03CR) 10Volans: [C: 032] wmf-auto-reimage: validate certificate fingerprint [puppet] - 10https://gerrit.wikimedia.org/r/433928 (owner: 10Volans) [07:24:09] (03CR) 10Volans: [C: 032] wmf-auto-reimage: improve donwtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) (owner: 10Volans) [07:24:18] (03PS6) 10Volans: wmf-auto-reimage: improve donwtime of reimaged host [puppet] - 10https://gerrit.wikimedia.org/r/434894 (https://phabricator.wikimedia.org/T195423) [07:25:06] (03CR) 10Muehlenhoff: [C: 032] Add pirroh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/445154 (https://phabricator.wikimedia.org/T198662) (owner: 10Muehlenhoff) [07:25:35] moritzm: I'm merging the reimage chain of changes, FYI [07:25:48] (03PS4) 10Volans: wmf-auto-reimage: use absolute path for subprocess [puppet] - 10https://gerrit.wikimedia.org/r/434896 [07:25:59] (03Abandoned) 10Muehlenhoff: Add pirroh to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/445154 (https://phabricator.wikimedia.org/T198662) (owner: 10Muehlenhoff) [07:26:19] volans: ack [07:27:26] (03CR) 10Volans: [C: 032] wmf-auto-reimage: use absolute path for subprocess [puppet] - 10https://gerrit.wikimedia.org/r/434896 (owner: 10Volans) [07:27:55] (03PS3) 10Volans: wmf-auto-reimage: fix parse argument bug [puppet] - 10https://gerrit.wikimedia.org/r/443670 [07:29:23] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix parse argument bug [puppet] - 10https://gerrit.wikimedia.org/r/443670 (owner: 10Volans) [07:29:39] (03PS3) 10Volans: wmf-auto-reimage: use warning log level [puppet] - 10https://gerrit.wikimedia.org/r/443671 [07:29:50] (03PS1) 10Marostegui: db-eqiad.php: Repool db1083 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445352 (https://phabricator.wikimedia.org/T197069) [07:30:27] (03CR) 10Volans: [C: 032] wmf-auto-reimage: use warning log level [puppet] - 10https://gerrit.wikimedia.org/r/443671 (owner: 10Volans) [07:31:19] (03PS2) 10Marostegui: db-eqiad.php: Repool db1083 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445352 (https://phabricator.wikimedia.org/T197069) [07:31:25] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Repool db1083 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445352 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:32:14] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request access to data for citation usage research - https://phabricator.wikimedia.org/T198662 (10MoritzMuehlenhoff) 05Open>03Resolved @Pirroh: You should be able to login now, if you run into any issues best to ping the #wikimedia-op... [07:34:06] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1083 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445352 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:35:49] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1083 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445352 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:37:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1083 with low weight (duration: 00m 57s) [07:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:28] !log reimaging spare system californium for testing reimage script changes [07:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:48] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1083 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445352 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:41:23] (03PS1) 10Marostegui: mariadb: Promote db1067 to s1 masters [puppet] - 10https://gerrit.wikimedia.org/r/445354 (https://phabricator.wikimedia.org/T197069) [07:42:24] (03PS2) 10Marostegui: mariadb: Promote db1067 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/445354 (https://phabricator.wikimedia.org/T197069) [07:43:16] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [puppet] - 10https://gerrit.wikimedia.org/r/445354 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:51:13] _joe_ elukey fyi I'm gonna start incident report for yesterday here https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eqiad [07:51:33] we can switch back to eqiad whenever you're ready [07:52:48] Pchelolo: morning!!! [07:52:56] I was chatting about it in the other chan [07:53:05] still have some ? open [07:53:44] 10Operations, 10ops-codfw: tegmen is down - https://phabricator.wikimedia.org/T199318 (10MoritzMuehlenhoff) 05Open>03Resolved Thanks, system looks stable hardware-wise now. [07:53:47] <_joe_> yeah my proposal for recovery is [07:54:04] (03PS3) 10Marostegui: mariadb: Promote db1067 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/445354 (https://phabricator.wikimedia.org/T197069) [07:54:09] <_joe_> 1 - start puppet on scb100*; this will start cpjobqueue and changeprop hopefully [07:54:32] <_joe_> 2 - switch eventbus back to active/active [07:54:43] <_joe_> so events from mediawiki will go back to eqiad [07:55:01] <_joe_> 3 - change the config of eventstreams, run puppet again on scb* [07:56:08] _joe_: I was thinking the same. the only thing is that 3 is disruptive for clients due to what Andrew said in that comment why eventstreams are pointing to eqiad only, but it doesn't seem avoidable [07:56:23] <_joe_> Pchelolo: exactly, and eventstreams is at fault [07:56:30] <_joe_> its architecture is clearly too naive [07:56:31] (03PS1) 10Ema: reload-vcl: manually set separate VCL files as warm [puppet] - 10https://gerrit.wikimedia.org/r/445357 (https://phabricator.wikimedia.org/T164609) [07:57:03] <_joe_> elukey: any reason to delay all this? [07:57:46] (03CR) 10Marostegui: [C: 04-2] "https://puppet-compiler.wmflabs.org/compiler02/11783/" [puppet] - 10https://gerrit.wikimedia.org/r/445354 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [07:58:06] _joe_ not that I know of, everything looks good from my point of view. I'd like to set the Kafka Xmx/Xms settings to 2G but will do it later on, eqiad is already running with these [07:58:22] Pchelolo's point about eventstreams is right though [07:58:32] we might need to alert people in operations/analytics [07:58:48] <_joe_> did we alert them yesterday? [07:58:57] <_joe_> but yes, send an email to wikitech that we had to do that [07:59:04] I only sent a message afterwards [07:59:16] to operations@ [07:59:21] 10Operations, 10Icinga, 10monitoring: Max concurrent service checks reached on tegmen - https://phabricator.wikimedia.org/T199413 (10MoritzMuehlenhoff) [07:59:23] 10Operations, 10Icinga, 10monitoring: Max concurrent service checks reached on tegmen - https://phabricator.wikimedia.org/T199413 (10MoritzMuehlenhoff) p:05Triage>03High [07:59:25] <_joe_> yeah answer to that message saying we're witching back [07:59:34] <_joe_> *switching back [07:59:42] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445358 [08:00:30] I am checking the last offset reset email that Andrew sent [08:00:43] it was broadcasted to several ml, will take those and send a message [08:01:40] _joe_ do we want to keep LimitNOFILE=infinity ? [08:01:45] <_joe_> elukey: yes [08:01:53] <_joe_> that's what killed kafka initially, right? [08:02:07] <_joe_> elukey: eventstreams needs to be re-thought, tbh [08:02:13] sure but I am not sure if was 65k the limit, it doesn't make a lot of sense [08:02:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445358 (owner: 10Marostegui) [08:02:24] anyhow, we can keep that for now [08:03:06] _joe_ it is difficult to re-think it since sadly it delegates consumption of kafka topics (via eventstreams) to external clients, so out of our controll [08:03:17] but yes let's discuss it with andrew when he is back [08:03:56] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445358 (owner: 10Marostegui) [08:04:27] (03CR) 10Vgutierrez: "> @Vgutierrez: @Giuseppe: Are the prometheus-prefixed dashboards" [puppet] - 10https://gerrit.wikimedia.org/r/445336 (owner: 10Krinkle) [08:05:13] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1083 (duration: 00m 56s) [08:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:57] Pchelolo: when did we switch eventbus and eventstreams to codfw exactly? (if you already have the timeline) [08:06:19] elukey: just woke up, no timeline yet, lemme check, 1 min [08:06:36] <_joe_> elukey: we have the times [08:06:38] <_joe_> in the SAL [08:07:13] <_joe_> 17:00 oblivian@puppetmaster1001: conftool action : set/pooled=false; selector: dnsdisc=eventbus,name=eqiad [08:07:39] (03Abandoned) 10Filippo Giunchedi: phabricator: bump request rate_limits [puppet] - 10https://gerrit.wikimedia.org/r/445145 (owner: 10Filippo Giunchedi) [08:07:50] <_joe_> and [08:07:50] yep yep I asked because I felt a bit lazy, I would have checked in there :) [08:07:53] <_joe_> 18:44 akosiaris: ok, change merged, running puppet on scb hosts [08:07:55] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445358 (owner: 10Marostegui) [08:08:10] both UTC or Rome timezone? [08:08:19] <_joe_> UTC [08:08:22] super [08:08:23] <_joe_> it's the SAL [08:08:23] thanks :) [08:08:28] 18:44 [08:08:38] ah sorry for a moment I thought it was your chat, still a bit sleepy [08:08:51] Going to send the email then to announce the switch back [08:09:26] <_joe_> ok preparing the change [08:11:04] (03PS1) 10Giuseppe Lavagetto: Revert "Switch eventstreams to consuming only from kafka-codfw." [puppet] - 10https://gerrit.wikimedia.org/r/445360 [08:11:17] <_joe_> that took a *ton* of effort [08:11:39] <_joe_> ok tell me when we're ready, I'll reenable puppet on scb100* in the meantime [08:11:58] <_joe_> !log reeabling and running puppet on scb1* [08:11:59] damn I am still sleepy [08:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:14] I saw the 18:44 log entry by _joe_ from me yesterday [08:12:23] and I thought he was running puppet on scb hosts [08:12:57] <_joe_> I am right now :D [08:12:57] and my sentence is a bit hard to read from what I gather... bear with me today [08:13:09] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445361 [08:13:21] (03PS1) 10Muehlenhoff: Rename wasat to mwmaint1001 for reimage [dns] - 10https://gerrit.wikimedia.org/r/445362 [08:13:26] akosiaris: the third cup of coffee helps [08:13:31] <_joe_> moritzm: 2001 :D [08:13:51] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [08:14:01] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [08:14:11] good point :-) [08:14:14] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The code seems correct but the commit message is wrong (1001 vs 2001)" [dns] - 10https://gerrit.wikimedia.org/r/445362 (owner: 10Muehlenhoff) [08:14:29] (03PS2) 10Muehlenhoff: Rename wasat to mwmaint2001 for reimage [dns] - 10https://gerrit.wikimedia.org/r/445362 [08:14:31] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [08:14:31] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [08:14:41] RECOVERY - puppet last run on scb1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:15:11] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/445336 (owner: 10Krinkle) [08:15:23] (03PS1) 10Marostegui: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/445363 (https://phabricator.wikimedia.org/T197069) [08:15:32] RECOVERY - puppet last run on scb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:17:12] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:17:24] _joe_ spam completed [08:18:10] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445361 (owner: 10Marostegui) [08:18:11] RECOVERY - puppet last run on scb1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:18:35] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [dns] - 10https://gerrit.wikimedia.org/r/445363 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [08:18:50] Pchelolo: I failed to send the email to huggle@ (not sure what that is) and mediawiki-api-announce@ (both used by Andrew the last time that we did invasive maintenance to Eventstreams) [08:19:47] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445361 (owner: 10Marostegui) [08:19:59] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445361 (owner: 10Marostegui) [08:20:05] <_joe_> ok let's go [08:20:57] !log oblivian@puppetmaster1001 conftool action : set/ttl=10; selector: dnsdisc=eventbus,name=.* [08:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:06] <_joe_> I reduced the TTL for eventbus [08:21:15] <_joe_> so the change can be more atomic [08:21:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1083 (duration: 00m 56s) [08:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:48] (03PS2) 10Giuseppe Lavagetto: Revert "Switch eventstreams to consuming only from kafka-codfw." [puppet] - 10https://gerrit.wikimedia.org/r/445360 [08:22:09] we also will re-process some of the events that did manage to get in kafka eqiad while change-prop or job queue were down there [08:22:29] s/re-process/actually-process [08:24:39] (03CR) 10Giuseppe Lavagetto: [C: 032] Revert "Switch eventstreams to consuming only from kafka-codfw." [puppet] - 10https://gerrit.wikimedia.org/r/445360 (owner: 10Giuseppe Lavagetto) [08:24:49] kafka metrics already moving, all good from the logs [08:24:49] <_joe_> Pchelolo: yeah [08:25:04] <_joe_> elukey: that's because we readded cpjobqueue and changeprop [08:25:06] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Yesterday the extended diagnostics on ms-be1040 took all day and revealed no problem {F23668273} [08:25:08] (i guess that changeprop is working) [08:25:09] yep yep [08:25:11] <_joe_> I still haven't migrated eventstreams [08:25:15] <_joe_> going to do it now [08:25:17] and eventbus [08:25:18] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445364 [08:25:54] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventbus,name=.* [08:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:04] <_joe_> eventbus migrated [08:26:18] <_joe_> and in fact I see less and less messages coming from eventstreams [08:26:41] <_joe_> merging the puppet change now [08:27:32] <_joe_> uhm, tbh, eventstreams sees messages coming from eqiad via mirror-maker [08:27:38] <_joe_> the only problem is the offsets [08:27:47] exactlu [08:27:50] *y [08:28:35] <_joe_> !log forcing puppet run on scb* to pick up the eventstreams change [08:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:21] RECOVERY - Host ms-be1040 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [08:29:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445364 (owner: 10Marostegui) [08:30:16] <_joe_> eventstreams works [08:30:43] <_joe_> I'm re-raising the TTL for eventbus [08:30:49] !log oblivian@puppetmaster1001 conftool action : set/ttl=300; selector: dnsdisc=eventbus,name=.* [08:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:53] all good from the logs [08:30:54] I am guessing come eventstreams clients will break [08:31:02] <_joe_> akosiaris: yes [08:31:04] but there's nothing we can do about it [08:31:10] <_joe_> akosiaris: yes [08:31:20] <_joe_> unless we change the way it's conceived [08:31:22] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445364 (owner: 10Marostegui) [08:31:37] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445364 (owner: 10Marostegui) [08:32:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1083 (duration: 00m 55s) [08:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] _joe_: as I understand the main problem with EventStreams is that they support reconnecting, but to reconnect they use the offset number when they disconnected, and mirror-maker doesn't support preserving offsets [08:35:26] <_joe_> Pchelolo: yeah the point is you're creating an API as a product, and as a user of that product I do not care if you provide me internal details of your implementation, as we do [08:35:52] <_joe_> so instead of the kafka offset, provide one global index and find yourself a way to syncronize that with both clusters [08:36:08] ye, sure, I'm not arguing it's correct, just providing some background that I finally remembered [08:36:26] <_joe_> I got what the issue is [08:36:37] <_joe_> I'm playing with eventstreams since yesterday evening :D [08:36:59] <_joe_> the function is cool and I can think of various ways to use it to do cool projects [08:39:35] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445365 [08:39:41] it is also possible to consume from different topics in the same stream, very cool [08:41:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445365 (owner: 10Marostegui) [08:43:23] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445365 (owner: 10Marostegui) [08:43:35] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445365 (owner: 10Marostegui) [08:43:51] I'll restart jobqueue in codfw, it logs some kafka disconnects [08:44:41] !log ppchelko@deploy1001 Started restart [cpjobqueue/deploy@ba672a3]: It logs disconnected consumers after many rebalances during the outage [08:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1083 (duration: 00m 56s) [08:44:48] kafka consumer lags do not show anything horrible [08:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:58] and mirror maker seems working fine [08:47:12] <_joe_> Pchelolo: that's a bit strange, shouldn't be happening [08:49:29] _joe_: that will be one of the action items for the incident report [08:50:13] what I'd like to do during the next hours is to have a skeleton of the main actions taken, and then check the kafka logs to see the chain of events that caused the mess [08:50:29] I don't have a clear picture yet of al lthe issues [08:51:13] elukey: bare with me, I'm writing it down at https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eqiad. I'll save from time to time [08:52:18] Pchelolo: I didn't mean to force you do it, I was thinking out loud :) If you want we can review the timings together on hangouts [08:52:24] I only need more caffeine [08:52:35] Not at the timings yet [08:53:03] well even other things if it helps a brain bounce [08:53:12] otherwise I'll shut up :) [08:54:31] <_joe_> anyways, it was a tough ride, thanks for handling it elukey Pchelolo <3 [08:55:01] _joe_: given that I've caused it in the first place, thank you for baring with us _joe_ [08:55:18] <_joe_> I'm sorry for not thinking of switching eventbus earlier, I realized I had a hole in my knowledge of the infrastructure (or better, I never thought about it properly in terms of failure scenarios) [08:55:37] <_joe_> Pchelolo: what caused the issue in the end? [08:56:04] In very short - dev cluster change-prop being too old [08:56:13] <_joe_> ouch [08:56:22] <_joe_> I feared something like that [08:56:39] <_joe_> that's why you don't share the infrastructure with -dev things [08:56:39] whatttt [08:56:41] specifically this commit not being deployed https://github.com/wikimedia/change-propagation/commit/718c7088ef460e1c33bdea4272c0e48fd851683c [08:56:56] _joe_: that's another action item for incident report [08:57:09] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Yesterday I've launched an explicit run of `swift-object-auditor` on all filesystems affected. So far no corrupted files have been found, note that... [08:57:22] plus Kafka stupidity allowing it to create topics that it can not process itself [08:57:33] <_joe_> yeah [08:57:36] (03PS1) 10Jcrespo: mariadb: Allow reimage of db1087 [puppet] - 10https://gerrit.wikimedia.org/r/445367 [08:57:50] <_joe_> ok, I need to go to the post office, and things seem ok now, right? [08:57:57] yep! [08:58:02] _joe_: all good on my side [08:58:08] I am grabbing coffee and then start looking at kafka logs [08:58:11] <_joe_> ok, ttyl [08:58:18] I'll try to finish the report before you come back [09:04:31] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db1087 [puppet] - 10https://gerrit.wikimedia.org/r/445367 (owner: 10Jcrespo) [09:04:47] (03PS2) 10Ema: reload-vcl: manually set separate VCL files as warm [puppet] - 10https://gerrit.wikimedia.org/r/445357 (https://phabricator.wikimedia.org/T164609) [09:05:13] (03CR) 10Ema: [C: 032] reload-vcl: manually set separate VCL files as warm [puppet] - 10https://gerrit.wikimedia.org/r/445357 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:06:50] (03PS1) 10Marostegui: db-eqiad.php: Set up s1 on read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445369 (https://phabricator.wikimedia.org/T197069) [09:07:30] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover date" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445369 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [09:08:33] (03PS1) 10Jcrespo: mariadb: Depool db1087 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445370 [09:10:07] (03PS1) 10Marostegui: db-eqiad.php: Promote db1067 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445371 (https://phabricator.wikimedia.org/T197069) [09:10:46] (03CR) 10Marostegui: [C: 04-2] "Wait for failover date" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445371 (https://phabricator.wikimedia.org/T197069) (owner: 10Marostegui) [09:13:41] elukey: wrote up a strawman for what happened https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eqiad [09:14:59] (03PS2) 10Jcrespo: mariadb: Depool db1087 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445370 [09:15:44] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [09:16:16] 10Operations, 10Cloud-Services, 10DC-Ops, 10decommission: decom californium - https://phabricator.wikimedia.org/T189921 (10Volans) [09:17:30] 10Operations, 10Cloud-Services, 10DC-Ops, 10decommission: decom californium - https://phabricator.wikimedia.org/T189921 (10Volans) @RobH FYI I've used this host for a test-reimage and the host doesn't want to reboot into PXE, it times out and reboots into the old OS. Given that it was already removed from... [09:17:33] !log ran puppet node clean/deactivate on db2064, hardware is broken for good and caused ongoing connection failures in cumin/debdeploy (T195228) [09:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:37] T195228: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 [09:17:50] Pchelolo: awesome :) [09:18:39] one question mark that I have is the following - from the file system errors on kafka, it seems that it was trying to appen a -$somedigits-delete suffix to a very long file name(s0 [09:19:01] as Joe pointed out, this might have been me trying to delete topics, ending up in that bug [09:19:10] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10MoritzMuehlenhoff) [09:19:30] but IIRC I haven't started the cleanup soon, we tried to bring kafka up several times [09:19:47] the first attempt was raising the open file max limit [09:19:59] that didn't work since we didn't set infinity but '' [09:20:21] but after that, in theory the cluster should have started [09:20:38] do you think that restabase-dev was still hammering it with topic creations? [09:21:40] elukey: I've stopped it and disabled puppet there [09:22:01] and feel free to improve my summary [09:23:48] (03PS1) 10KartikMistry: apertium-fra-cat: Updated deps for apertium-separable [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/445372 (https://phabricator.wikimedia.org/T189076) [09:30:04] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 94, down: 0, shutdown: 6 [09:31:25] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0 [09:33:31] (03CR) 10Arturo Borrero Gonzalez: [C: 031] "The old setup is 22 rules. The new setup is 184 rules. Half IPv4/IPv6 in both cases." [puppet] - 10https://gerrit.wikimedia.org/r/445126 (https://phabricator.wikimedia.org/T164609) (owner: 10Ema) [09:35:38] (03CR) 10Volans: [C: 04-1] "See inline, also you could just remove the wasat records for the host (not mgmt) if you want, but it's ok also to cleanup them later on al" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/445362 (owner: 10Muehlenhoff) [09:40:51] (03PS1) 10Alexandros Kosiaris: icinga: Bump max_concurrent_checks to 10k [puppet] - 10https://gerrit.wikimedia.org/r/445375 (https://phabricator.wikimedia.org/T199413) [09:41:02] (03CR) 10Muehlenhoff: Rename wasat to mwmaint2001 for reimage (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/445362 (owner: 10Muehlenhoff) [09:41:04] (03PS3) 10Muehlenhoff: Rename wasat to mwmaint2001 for reimage [dns] - 10https://gerrit.wikimedia.org/r/445362 [09:43:16] (03PS3) 10Marostegui: filtered_tables: Remove ar_text and ar_flags [puppet] - 10https://gerrit.wikimedia.org/r/437432 (https://phabricator.wikimedia.org/T192926) [09:43:36] (03CR) 10Volans: [C: 031] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/445362 (owner: 10Muehlenhoff) [09:43:59] (03CR) 10Marostegui: [C: 032] filtered_tables: Remove ar_text and ar_flags [puppet] - 10https://gerrit.wikimedia.org/r/437432 (https://phabricator.wikimedia.org/T192926) (owner: 10Marostegui) [09:46:15] !log shut down ms-be1041 for hardware diagnostics - T199198 [09:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:19] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [09:49:15] (03CR) 10Muehlenhoff: [C: 032] Rename wasat to mwmaint2001 for reimage [dns] - 10https://gerrit.wikimedia.org/r/445362 (owner: 10Muehlenhoff) [09:50:34] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 [09:50:47] (03PS3) 10Muehlenhoff: Reimage wasat with stretch and rename to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445149 (https://phabricator.wikimedia.org/T192092) [09:52:24] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [09:54:34] (03CR) 10Muehlenhoff: [C: 032] Reimage wasat with stretch and rename to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445149 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [10:08:02] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jmm on sarin.codfw.wmnet for hosts: ``` wasat.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807121007_jmm_1801... [10:09:39] (03CR) 10Filippo Giunchedi: "ow, see for example modules/role/manifests/prometheus/ops.pp for the correct config syntax (iow file_sd_configs and files as sections)" [puppet] - 10https://gerrit.wikimedia.org/r/445251 (owner: 10Andrew Bogott) [10:12:59] 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) Interesting! I thought it was something similar to we've experienced this problem before in {T136312} and https://github.com/rsyslog/rsyslog/issues/1728. Supposedly the latest rsyslog version f... [10:22:37] Anybody here who could review/merge my graphite puppet change? [10:22:37] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/443370/ [10:23:31] ^ godog maybe? [10:24:59] has a disk space calculation been done for that? [10:26:01] (03CR) 10Addshore: Add monthly storage schema for graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) (owner: 10Jonas Kress (WMDE)) [10:26:13] also I see that we already have daily for 25y already, so the same data is already available [10:27:16] (03PS6) 10Jonas Kress (WMDE): Add monthly storage schema for graphite [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) [10:28:10] (03CR) 10Addshore: Add monthly storage schema for graphite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) (owner: 10Jonas Kress (WMDE)) [10:29:11] Jonas_WMDE: I don't think the 30d will actually work very well at all [10:29:18] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mwmaint2001.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['mwmaint2001.codfw.wmnet'] ``` [10:29:44] (03CR) 10Addshore: [C: 04-1] "-1 per my second comment on PS5" [puppet] - 10https://gerrit.wikimedia.org/r/443370 (https://phabricator.wikimedia.org/T193641) (owner: 10Jonas Kress (WMDE)) [10:33:15] Jonas_WMDE: it would likely end up with odd stuff like 010118 then 310118 then 290218 etc [10:33:51] agreed, not sure what's the purpose? [10:36:49] (03PS1) 10Giuseppe Lavagetto: deploy-apache-change: stop asking for confirmation [puppet] - 10https://gerrit.wikimedia.org/r/445382 [10:44:07] (03CR) 10Giuseppe Lavagetto: [C: 032] deploy-apache-change: stop asking for confirmation [puppet] - 10https://gerrit.wikimedia.org/r/445382 (owner: 10Giuseppe Lavagetto) [10:46:31] 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 3 others: kafka eqiad cluster keeps crashing - https://phabricator.wikimedia.org/T199353 (10mobrovac) 05Open>03Resolved p:05Triage>03Unbreak! a:03Pchelolo Kafka and the affected services are back and operational now, resolving. [10:47:20] 10Operations, 10Analytics, 10Analytics-EventLogging, 10EventBus, and 3 others: kafka eqiad cluster keeps crashing - https://phabricator.wikimedia.org/T199353 (10mobrovac) The incident report can be found [here](https://wikitech.wikimedia.org/wiki/Incident_documentation/20180711-kafka-eqiad). [10:50:11] (03CR) 10Alexandros Kosiaris: [C: 031] Enable base::service_auto_restart for SSH [puppet] - 10https://gerrit.wikimedia.org/r/444230 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:54:04] 10Operations, 10Wikimedia-Logstash, 10Goal, 10User-fgiunchedi, 10User-herron: Modernize logging, alerting and metrics monitoring infrastructure - Adopt Logstash (2018-19 Q1 Goal) - https://phabricator.wikimedia.org/T198753 (10fgiunchedi) [10:54:41] 10Operations, 10Wikimedia-Logstash, 10Goal: Logstash/Kibana architecture review - https://phabricator.wikimedia.org/T198754 (10fgiunchedi) [10:55:43] 10Operations, 10Wikimedia-Logstash, 10Goal: Audit log producers across the infrastructure and plan their transition to centralized logging. - https://phabricator.wikimedia.org/T198756 (10fgiunchedi) [10:57:12] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 (10fgiunchedi) [10:58:02] 10Operations, 10Wikimedia-Logstash, 10Goal: Investigate log shipping methods and standardize on them (logstash) - https://phabricator.wikimedia.org/T198757 (10fgiunchedi) [10:59:25] Hello [10:59:28] chiborg: ping? [10:59:43] Dereckson hi [10:59:46] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 96, down: 0, shutdown: 4 [10:59:57] I can SWAT today. [11:00:04] chiborg: I'm not sure if it's valuable to merge to wmf10 branch [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European Mid-day SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1100). [11:00:05] chiborg: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:17] zeljkof: zeljkof is going to update group2 to wmf12 after SWAT [11:00:20] chiborg: ^ [11:00:31] Dereckson yeah, you can skip wmf10 [11:00:34] ok [11:00:44] let's deploy to wmf12 so [11:01:29] Dereckson: are you doing swat today? [11:01:32] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Max concurrent service checks reached on tegmen - https://phabricator.wikimedia.org/T199413 (10Volans) I've cleaned the 33G file so that the host could recover, I'm looking at the other overly big log files cause we'd had at least one fallout from t... [11:01:35] zeljkof: yes [11:01:50] ok, I'm around if needed :) [11:02:51] o/ [11:03:21] chiborg: live on mwdebug1002 [11:04:30] Dereckson, do you have space for some last minute patches? I just realized SWAT isnt in 2 hours :D [11:04:37] Urbanecm: sure [11:04:44] Ok, going to add them into the calendar. [11:06:37] Dereckson, they're added, ping me when they'll be deployed. Thank you! [11:07:54] Dereckson checking ... [11:08:29] !log swift eqiad-prod: more weight to ms-be1036 after hw repair [11:08:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:17] Dereckson: I can't use the new event logging schema definition (WMDEBannerEvents) yet on mwdebug. Is it possible that the schema IDs are cached somewhere or have to propagate? [11:10:32] (03CR) 10Zhuyifei1999: [C: 031] "LGTM. Not sure if we should increase the timeout as well" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha) [11:11:31] chiborg: no idea [11:11:55] I checked the MD5 of extension.json, it's well the correct new version [11:12:11] Dereckson since this is "only" a change to the schema IDs in MediawikiEvent I think it's safe to roll out even when it does not work yet. [11:13:15] https://cho.wikipedia.org/w/extensions/WikimediaEvents/extension.json on mwdebug1002 shows the id [11:13:27] okay [11:14:12] syncing to prod [11:14:22] Dereckson I also don't know how the server-side event logging picks up all the valid IDs and detects when new ones are added in some extension.json [11:14:52] !log dereckson@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/WikimediaEvents/extension.json: Add event logging for WMDE fundraising banners (duration: 00m 58s) [11:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:57] (03PS2) 10Dereckson: Use bewikibooks.png in wgLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445104 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [11:14:59] Thanks. I will continue to try calling mw.eventLog.logEvent('WMDEBannerEvents', {'bannerName':'justatatest','bannerAction':'banner-closed','eventRate':0.5}) in the console ... [11:15:04] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Max concurrent service checks reached on tegmen - https://phabricator.wikimedia.org/T199413 (10Volans) a:03Volans [11:15:29] (03CR) 10Dereckson: [C: 032] Use bewikibooks.png in wgLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445104 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [11:16:34] (03CR) 10Zhuyifei1999: [C: 04-1] Removing gridengine as default and encouraging the use of Kubernetes (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [11:17:23] (03Merged) 10jenkins-bot: Use bewikibooks.png in wgLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445104 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [11:18:05] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.013 second response time [11:25:20] (03CR) 10Dereckson: [C: 032] Replace spaces with underscores in bnwikisource ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [11:25:27] (03PS3) 10Dereckson: Replace spaces with underscores in bnwikisource ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [11:25:39] (03CR) 10Dereckson: [C: 032] Replace spaces with underscores in bnwikisource ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [11:27:48] (03Merged) 10jenkins-bot: Replace spaces with underscores in bnwikisource ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [11:28:57] Urbanecm: both live on mwdebug1002 [11:29:02] ack [11:30:25] Dereckson, they work correctly, please sync them [11:30:29] (03PS2) 1020after4: Add phabricator-antivandalism extension to the library path [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) [11:30:52] (03PS2) 1020after4: Disable phabricator rate limits [puppet] - 10https://gerrit.wikimedia.org/r/445328 (https://phabricator.wikimedia.org/T198974) [11:31:18] Dereckson Could it be that WikimediaEvents needs a wmf/1.32.0-wmf.12 branch and only the wmf.10 branch is deployed? See https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEvents/ [11:31:52] Urbanecm: syncing [11:31:55] ack [11:32:24] (03PS1) 10Giuseppe Lavagetto: mcrouter_generate_certs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/445387 [11:32:46] !log dereckson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix bewikibooks logo (T189218) and bnwikisource namespace (T199161) (duration: 00m 57s) [11:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:51] T199161: There's forbidden space in ExtraNamespaces definition for bnwikisource - https://phabricator.wikimedia.org/T199161 [11:32:51] T189218: Change bewikibooks logo - https://phabricator.wikimedia.org/T189218 [11:33:02] <_joe_> moritzm: ^^ [11:33:39] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter_generate_certs: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/445387 (owner: 10Giuseppe Lavagetto) [11:34:09] 10Operations, 10ChangeProp, 10Services (designing): Separate dev Change-Prop from production Kafka cluster - https://phabricator.wikimedia.org/T199427 (10Pchelolo) [11:34:47] chiborg: we've deployed to *wmf12*, so you need to test on a wiki frm group0 or group1 @ https://tools.wmflabs.org/versions [11:34:57] ack, thanks [11:36:12] looking at https://test.wikipedia.org/wiki/Special:Version (with wmf-debug still active) shows WikimediaEvents with a version from July 5, not from today [11:37:09] Special:Version is heavaily cached [11:37:21] we noticed that in a previous deployement [11:38:24] Dereckson ok. I just saw that WikimediaEvents does not have a wmf.12 tag and wondered if that could be the cause of the schema not being accepted. But it's a stretch. [11:42:51] It's more, the git cache is only updated in certain circumstances [11:43:17] chiborg: ? it does https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/tree/wmf/1.32.0-wmf.12 [11:45:10] Reedy My bad, I misread the gerrit gittiles output. Sorry. [11:45:24] Yeah, it's not the best UI :( [11:45:47] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jmm on sarin.codfw.wmnet for hosts: ``` mwmaint2001.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807121145_jm... [11:45:49] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mwmaint2001.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['mwmaint2001.codfw.wmnet'] ``` [11:49:56] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jmm on sarin.codfw.wmnet for hosts: ``` mwmaint2001.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807121149_jm... [11:49:59] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mwmaint2001.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['mwmaint2001.codfw.wmnet'] ``` [11:52:40] (03CR) 10jenkins-bot: Use bewikibooks.png in wgLogo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445104 (https://phabricator.wikimedia.org/T189218) (owner: 10Urbanecm) [11:52:42] (03CR) 10jenkins-bot: Replace spaces with underscores in bnwikisource ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444749 (https://phabricator.wikimedia.org/T199161) (owner: 10Urbanecm) [11:53:15] chiborg: Which wiki are you testing it on? [11:53:21] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jmm on sarin.codfw.wmnet for hosts: ``` mwmaint2001.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807121153_jm... [11:54:14] 10Operations, 10ChangeProp, 10MediaWiki-JobQueue, 10Services (designing), 10Wikimedia-Incident: Consider the possibility of separating ChangeProp and JobQueue on Kafka level - https://phabricator.wikimedia.org/T199431 (10Pchelolo) [11:54:15] Reedy test.wikipedia.org and commons [11:54:28] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mwmaint2001.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['mwmaint2001.codfw.wmnet'] ``` [11:55:05] 10Operations, 10ChangeProp, 10Services (designing), 10Wikimedia-Incident: Separate dev Change-Prop from production Kafka cluster - https://phabricator.wikimedia.org/T199427 (10Pchelolo) [11:56:35] I wonder if https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/commit/e19065b5731c512313cda5dc75f85b29c051814e is realted [11:58:07] ["WMDEBannerEvents"]=> [11:58:07] int(18193948) [11:58:07] ["WMDEBannerSizeIssue"]=> [11:58:07] int(18193993) [11:58:10] That looks right though [11:58:15] Reedy Interesting. I'd have to ask someone more knowdlegeable than me how changes from extension.json are propagated to the event logging system configuration. [11:59:28] Reedy Or maybe I'm testing wrong. Im' doing mw.eventLog.logEvent('WMDEBannerEvents', {'bannerName':'justatatest','bannerAction':'banner-closed','eventRate':0.5}) [11:59:32] on beta that works [12:00:00] Could be js related caching [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1200) [12:00:16] Dereckson: Might be worth just pushing it live, then chiborg waiting for the js caches to expire [12:01:20] Reedy: it's live [12:01:36] ok [12:06:10] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jmm on sarin.codfw.wmnet for hosts: ``` mwmaint2001.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807121205_jm... [12:06:12] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mwmaint2001.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['mwmaint2001.codfw.wmnet'] ``` [12:08:05] PROBLEM - HP RAID on ms-be1036 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:08:15] PROBLEM - SSH on ms-be1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:09:46] PROBLEM - configured eth on ms-be1036 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:10:45] PROBLEM - swift-container-replicator on ms-be1036 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [12:11:45] RECOVERY - swift-container-replicator on ms-be1036 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:11:55] RECOVERY - configured eth on ms-be1036 is OK: OK - interfaces up [12:12:26] RECOVERY - SSH on ms-be1036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [12:17:49] aye 1036 is the rebalance [12:34:54] jouncebot: now [12:34:54] For the next 0 hour(s) and 25 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1200) [12:35:14] o/ zeljkof i guess your still running the train? [12:39:37] (03PS4) 10Arturo Borrero Gonzalez: openstack: bootstrap: neutron: refresh and add more hints [puppet] - 10https://gerrit.wikimedia.org/r/444222 (https://phabricator.wikimedia.org/T196633) [12:48:07] (03PS1) 10DCausse: [cirrus] allow term_freq and remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445399 [12:48:51] 10Operations, 10monitoring: Alert on negative disk space available - https://phabricator.wikimedia.org/T199436 (10fgiunchedi) [12:52:42] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: bootstrap: neutron: refresh and add more hints [puppet] - 10https://gerrit.wikimedia.org/r/444222 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:54:49] PROBLEM - mediawiki-installation DSH group on mwmaint2001 is CRITICAL: Host mwmaint2001 is not in mediawiki-installation dsh group [12:56:14] 10Operations, 10monitoring: Alert on negative disk space available - https://phabricator.wikimedia.org/T199436 (10fgiunchedi) Upstream issue: https://github.com/monitoring-plugins/monitoring-plugins/issues/1544 [12:57:04] !log Drop unused grants from db1073 [12:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:29] (03CR) 10Cparle: [C: 031] [cirrus] allow term_freq and remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445399 (owner: 10DCausse) [13:00:04] zeljkof: #bothumor I � Unicode. All rise for MediaWiki train - European version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1300). [13:01:39] PROBLEM - HTTP-noc on mwmaint2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:01:49] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.011 second response time [13:02:59] (03PS1) 10Muehlenhoff: Switch dbtree over to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/445400 [13:03:08] ^ reimage, silencing [13:04:18] addshore: sorry, was out of lunch :) yes, train-ing this and the next week [13:05:06] !log run xfs_repair /dev/sdd1 on ms-be1043 - T199198 [13:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:09] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [13:09:52] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fra-cat: Updated deps for apertium-separable [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/445372 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [13:11:55] (03PS1) 10Zfilipin: all wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445403 [13:11:57] (03CR) 10Zfilipin: [C: 032] all wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445403 (owner: 10Zfilipin) [13:13:26] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445403 (owner: 10Zfilipin) [13:13:39] RECOVERY - HTTP-noc on mwmaint2001 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.073 second response time [13:13:43] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445403 (owner: 10Zfilipin) [13:14:18] !log installing openssh updates from stretch 9.4 point release [13:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:33] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.12 [13:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:59] !log upload apertium-fra-cat_1.3.0~r84327-1+wmf2 to apt.wikimedia.org/jessie-wikimedia/main [13:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:47] I'm doing this https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Thursday:_group{0,1}_to_all_deploy [13:16:08] kart_: ^ [13:16:15] `zfilipin@deploy1001:~$ ./release/bin/deploy-promote all` [13:16:16] I 'll upgrade it on the scb cluster now [13:16:34] and the script said `Result: SUCCESS` [13:16:49] but it also said `1 hosts had sync_wikiversions errors` [13:17:05] win 10 [13:17:17] I guess this is it: `sudo -u mwdeploy -n -- /usr/bin/rsync -l deploy1001.eqiad.wmnet::common/wikiversions*.{json,php} /srv/mediawiki on wasat.codfw.wmnet returned [255]: Host key verification failed.` [13:17:39] PROBLEM - DPKG on analytics1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:17:39] PROBLEM - DPKG on analytics1075 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:17:49] (03PS1) 10Sau226: Generic placeholder to be updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445405 [13:17:51] 10Operations, 10MediaWiki-API, 10MediaWiki-General-or-Unknown: Unit of measure in Quantity datatype when written with bots (Wikidata API) is not (correctly) shown in Wikidata pages. - https://phabricator.wikimedia.org/T199438 (10Considering.Different.Routes) [13:17:56] moritzm: is there a problem with wasat.codfw.wmnet? see my comments above [13:17:58] PROBLEM - DPKG on analytics1073 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:18:10] it was being reimaged today [13:18:18] PROBLEM - DPKG on analytics1071 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:18:19] PROBLEM - DPKG on analytics1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:18:19] so new key [13:18:28] PROBLEM - DPKG on analytics1074 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:18:35] akosiaris: do I need to do anything? [13:18:38] PROBLEM - DPKG on analytics1077 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:18:39] PROBLEM - DPKG on analytics1076 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:19:21] (03Abandoned) 10Sau226: Generic placeholder to be updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445405 (owner: 10Sau226) [13:19:28] zeljkof: /me looking [13:19:36] looking at the hadoop hosts, that's likely the new openssh [13:19:36] (03CR) 10Vgutierrez: [C: 031] Move FSM connect state handling to the FSM itself [debs/pybal] - 10https://gerrit.wikimedia.org/r/434163 (owner: 10Mark Bergsma) [13:19:41] akosiaris: thanks! [13:20:38] zeljkof, akosiaris: sorry, forgot to drop wasat from the dsh group, it's being reimaged to stretch and renamed to mwmaint2001 for consistency [13:20:41] fixing that now [13:20:51] moritzm: ah nice, thanks [13:21:07] moritzm: ok, so I can just ignore the message, or is there anything I need to do? [13:21:25] zeljkof: you can ignore it, it'll be fixed soonish [13:21:38] moritzm: thanks! [13:21:41] the host no longer exists (or rather exists under a new name now) [13:21:58] PROBLEM - puppet last run on analytics1072 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [13:22:02] (03PS3) 10Sau226: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:22:44] (03PS1) 10Muehlenhoff: Remove wasat from dsh [puppet] - 10https://gerrit.wikimedia.org/r/445406 [13:22:46] (03CR) 10Jcrespo: [C: 031] Switch dbtree over to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/445400 (owner: 10Muehlenhoff) [13:23:26] (03Restored) 10Sau226: Generic placeholder to be updated [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445405 (owner: 10Sau226) [13:23:44] (03CR) 10Muehlenhoff: [C: 032] Remove wasat from dsh [puppet] - 10https://gerrit.wikimedia.org/r/445406 (owner: 10Muehlenhoff) [13:24:02] akosiaris: it was renamed to mwmaint2001 [13:24:05] to be precise ;) [13:24:38] PROBLEM - puppet last run on analytics1070 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [13:24:50] (03PS4) 10Sau226: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:25:13] (03CR) 10Sau226: "Sorry @Urbanecm for intruding on your patch. I'll not do that again in future" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:25:45] (03PS1) 10Muehlenhoff: Re-add mwmaint2001 to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/445407 [13:26:27] (03PS1) 10Rush: dumps: point cloud vps hosts at labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/445408 (https://phabricator.wikimedia.org/T198420) [13:27:07] (03PS1) 10Arturo Borrero Gonzalez: openstack: neutron metadata agent: add support for jessie/mitaka [puppet] - 10https://gerrit.wikimedia.org/r/445409 (https://phabricator.wikimedia.org/T196633) [13:27:21] (03CR) 10Urbanecm: "@sau226 Well, it doesn't matter :). Will assign this permissions to admins and schedule this for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:27:31] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:27:48] RECOVERY - DPKG on analytics1073 is OK: All packages OK [13:28:09] RECOVERY - DPKG on analytics1071 is OK: All packages OK [13:28:09] RECOVERY - DPKG on analytics1070 is OK: All packages OK [13:28:19] RECOVERY - DPKG on analytics1074 is OK: All packages OK [13:28:29] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [13:28:48] PROBLEM - puppet last run on analytics1073 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [13:28:57] (03PS2) 10Sau226: Allow privileged users to add and remove templateeditor right on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445405 (https://phabricator.wikimedia.org/T198056) [13:29:38] RECOVERY - DPKG on analytics1077 is OK: All packages OK [13:29:39] RECOVERY - DPKG on analytics1076 is OK: All packages OK [13:31:24] (03PS5) 10Urbanecm: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) [13:31:51] (03PS6) 10Urbanecm: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) [13:33:47] (03CR) 10Sau226: "I already added the flags in the patch I made. Feel free to comment/edit if needed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:38:44] (03CR) 10Urbanecm: [C: 04-1] "This is and should be handled in 441839. Please abandon this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445405 (https://phabricator.wikimedia.org/T198056) (owner: 10Sau226) [13:41:30] (03CR) 10Vgutierrez: "I'm missing a test that ensures that passiveStart is properly handled in BGPPeering.__init__. Right now it could be hardcoded to False and" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/436297 (owner: 10Mark Bergsma) [13:42:17] (03Abandoned) 10Sau226: Allow privileged users to add and remove templateeditor right on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445405 (https://phabricator.wikimedia.org/T198056) (owner: 10Sau226) [13:42:19] (03CR) 10Urbanecm: "No need for 2 patches, in this case." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:42:22] 10Operations, 10Analytics, 10EventBus, 10Services (watching), 10Wikimedia-Incident: Document the process for hard-deleting topics in kafka - https://phabricator.wikimedia.org/T199441 (10Pchelolo) [13:43:15] 10Operations, 10Analytics, 10EventBus, 10Services (watching), and 2 others: Document the process for hard-deleting topics in kafka - https://phabricator.wikimedia.org/T199441 (10elukey) p:05Triage>03Normal [13:43:41] (03CR) 10Sau226: [C: 031] "Just make sure to add the sysop + bureaucrat flags. From code working standpoint looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:44:18] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [13:45:58] (03PS7) 10Sau226: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:47:18] PROBLEM - puppet last run on analytics1075 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[openssh-server],Exec[set debconf flag seen for wireshark-common/install-setuid] [13:47:45] 10Operations, 10Discovery-Search (Current work): migrate elasticsearch to stretch (from jessie) - https://phabricator.wikimedia.org/T193649 (10Gehel) SSL Cert issue was only that SSL Certs needed to be rehashed (`sudo update-ca-certificates -f`). I can't believe it took me so long to find that. So `deployment-... [13:48:00] 10Operations, 10CommRel-Internals, 10Wikimedia-Mailing-lists: Rename (create anew) the TC team mailing list - https://phabricator.wikimedia.org/T155683 (10Qgil) 05stalled>03Open OK, in more detail. We want to close https://lists.wikimedia.org/mailman/listinfo/cep and keep the archive for now (maybe in th... [13:49:53] (03CR) 10Sau226: "All flags are set. Waiting for Martin to handle the rest" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [13:55:08] RECOVERY - puppet last run on analytics1070 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:58:20] (03PS8) 10Urbanecm: Create TemplateEditor group on enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) [13:59:01] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:59:20] RECOVERY - puppet last run on analytics1073 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:59:31] (03CR) 10Urbanecm: "There's no need to assign sysop-level permissions to bureacurats. For example, bureaucrat doesn't have the right to delete pages, as well " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/441839 (https://phabricator.wikimedia.org/T198056) (owner: 10Urbanecm) [14:07:42] !log Drop unused grants from db2068 [14:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:08] (03PS1) 10Fdans: Suppress false alerts in sqoop mediawiki tables [puppet] - 10https://gerrit.wikimedia.org/r/445412 (https://phabricator.wikimedia.org/T198966) [14:10:57] (03CR) 10jerkins-bot: [V: 04-1] Suppress false alerts in sqoop mediawiki tables [puppet] - 10https://gerrit.wikimedia.org/r/445412 (https://phabricator.wikimedia.org/T198966) (owner: 10Fdans) [14:11:14] (03CR) 10Muehlenhoff: [C: 032] Re-add mwmaint2001 to dsh group [puppet] - 10https://gerrit.wikimedia.org/r/445407 (owner: 10Muehlenhoff) [14:11:16] (03CR) 10Fdans: [C: 04-1] "messed up a conflict, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/445412 (https://phabricator.wikimedia.org/T198966) (owner: 10Fdans) [14:11:35] (03PS4) 10Giuseppe Lavagetto: mediawiki: split all of remnant.conf into individual vhosts [puppet] - 10https://gerrit.wikimedia.org/r/444187 [14:12:24] jouncebot: now [14:12:24] For the next 0 hour(s) and 47 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1300) [14:12:58] zeljkof: everything went according to plan? [14:13:00] (03PS2) 10Fdans: Suppress false alerts in sqoop mediawiki tables [puppet] - 10https://gerrit.wikimedia.org/r/445412 (https://phabricator.wikimedia.org/T198966) [14:13:59] jouncebot: next [14:14:00] In 1 hour(s) and 46 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1600) [14:14:07] addshore: yes, we won the game yesterday ;) [14:14:11] elukey: this is the approach you were suggesting right? I don't see a way to suppress those messages specifically [14:14:17] haha :P [14:14:23] elukey: (patch above) [14:14:31] zeljkof: in that case, mind if I backport this fix for a regresion now? https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/445413/ [14:14:41] <_joe_> zeljkof: strange, I didn't see you on the pitch :D [14:15:03] (03PS3) 10Elukey: Suppress false alerts in sqoop mediawiki tables [puppet] - 10https://gerrit.wikimedia.org/r/445412 (https://phabricator.wikimedia.org/T198966) (owner: 10Fdans) [14:15:05] _joe_: I was on the bench :) [14:15:28] fdans: yep! [14:15:43] addshore: go ahead as far as I am concerned [14:15:53] (03CR) 10Elukey: [C: 032] Suppress false alerts in sqoop mediawiki tables [puppet] - 10https://gerrit.wikimedia.org/r/445412 (https://phabricator.wikimedia.org/T198966) (owner: 10Fdans) [14:15:58] zeljkof: thanks! [14:16:01] awyisss [14:16:02] jakob_WMDE: doing it now :) [14:16:15] (03PS5) 10Giuseppe Lavagetto: mediawiki: split all of remnant.conf into individual vhosts [puppet] - 10https://gerrit.wikimedia.org/r/444187 [14:16:19] addshore: awesome, thanks! [14:17:27] elukey: so I'm guessing now we remove the current crons and run puppet? [14:21:13] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: split all of remnant.conf into individual vhosts [puppet] - 10https://gerrit.wikimedia.org/r/444187 (owner: 10Giuseppe Lavagetto) [14:26:12] (03PS1) 10Giuseppe Lavagetto: mediawiki_test: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/445415 [14:26:38] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki_test: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/445415 (owner: 10Giuseppe Lavagetto) [14:27:03] fdans: the cron gets updated when puppet runs, all good :) [14:27:10] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:27:24] elukey: hell yea, what a time to be alive ;) [14:28:11] (03PS2) 10Elukey: role::kafka::main: raise Kafka Java Xmx/Xms [puppet] - 10https://gerrit.wikimedia.org/r/445304 [14:30:30] (03PS1) 10Giuseppe Lavagetto: mediawiki_test: brown paper back fix [puppet] - 10https://gerrit.wikimedia.org/r/445416 [14:30:44] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] mediawiki_test: brown paper back fix [puppet] - 10https://gerrit.wikimedia.org/r/445416 (owner: 10Giuseppe Lavagetto) [14:30:51] (03PS2) 10Muehlenhoff: Switch dbtree over to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/445400 [14:31:58] jakob_WMDE: it is on mwdebug1002 [14:32:07] (03CR) 10Elukey: "Pcc: https://puppet-compiler.wmflabs.org/compiler02/11785/" [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [14:32:53] (03CR) 10Muehlenhoff: [C: 032] Switch dbtree over to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/445400 (owner: 10Muehlenhoff) [14:33:03] (03PS3) 10Muehlenhoff: Switch dbtree over to mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/445400 [14:34:21] and jakob_WMDE Q4105711 looks good after a purge :) [14:34:26] addshore: yay works! [14:34:40] yup, also just checked after some initial confusion due to caching :D [14:34:45] addshore: thanks! [14:34:57] will sync now [14:37:20] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:37:39] !log addshore@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/Wikibase/repo: T199379 Fix missing label for unit items when displayed on entity page [[gerrit:445413]] (duration: 01m 02s) [14:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:42] T199379: Qid displayed instead of unit name - https://phabricator.wikimedia.org/T199379 [14:37:43] * addshore expects a small collection of logs due to the nature of the patch [14:38:06] jakob_WMDE: should be all done [14:38:14] zeljkof: thanks for letting me steal the end of the slot [14:38:25] addshore: thanks! [14:38:44] addshore: it's the least I could do after what happened yesterday ;) [14:38:51] PROBLEM - HHVM jobrunner on mw1304 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [14:38:53] hahahaa :P [14:39:03] :D [14:40:00] RECOVERY - HHVM jobrunner on mw1304 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [14:44:11] (03CR) 10Paladox: "ok." [puppet] - 10https://gerrit.wikimedia.org/r/445329 (https://phabricator.wikimedia.org/T162026) (owner: 1020after4) [14:44:11] zeljkof: I just got on the computer, how's it going today? :) [14:44:30] greg-g: smooth sailing so far captain! :) [14:44:34] * zeljkof salutes [14:44:54] (as far as I can tell, but nobody is screaming at me yet) [14:45:39] zeljkof: very awesome :) [14:50:54] (03PS1) 10Muehlenhoff: Reflect new name of wasat in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/445418 [14:55:01] RECOVERY - mediawiki-installation DSH group on mwmaint2001 is OK: OK [14:59:07] (03Abandoned) 10Muehlenhoff: tcpircbot: remove terbium from ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/430530 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:01:13] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10ayounsi) Let me (or DCops) know if/when we can rename the switch port description. [15:01:41] (03PS2) 10Arturo Borrero Gonzalez: openstack: neutron metadata agent: add support for jessie/mitaka [puppet] - 10https://gerrit.wikimedia.org/r/445409 (https://phabricator.wikimedia.org/T196633) [15:01:51] (03CR) 10Ayounsi: [C: 031] Reflect new name of wasat in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/445418 (owner: 10Muehlenhoff) [15:02:15] (03CR) 10Mobrovac: "I wounder if it would be better to have specific params for the class, like heap_size and then turn them into Java opts. Allowing the pass" [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [15:02:35] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: neutron metadata agent: add support for jessie/mitaka [puppet] - 10https://gerrit.wikimedia.org/r/445409 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:04:16] (03PS1) 10Muehlenhoff: Update grants for terbium->mwmaint1001 migration and wasat rename [puppet] - 10https://gerrit.wikimedia.org/r/445421 (https://phabricator.wikimedia.org/T192092) [15:05:41] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: eqiad1: add cloudvirt1022 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/445422 (https://phabricator.wikimedia.org/T196633) [15:06:31] (03CR) 10Arturo Borrero Gonzalez: [C: 032] hieradata: openstack: eqiad1: add cloudvirt1022 to the pool [puppet] - 10https://gerrit.wikimedia.org/r/445422 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [15:06:34] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) You can have a look at the historical values we have for * update lag: https://gr... [15:07:52] did the group2 wikis deploy go out? [15:07:56] !log cloudvirt1021:~# /sbin/reboot [15:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:50] (03PS1) 10Muehlenhoff: Decomission terbium [puppet] - 10https://gerrit.wikimedia.org/r/445423 [15:13:26] (03PS2) 10Muehlenhoff: Decommission terbium [puppet] - 10https://gerrit.wikimedia.org/r/445423 [15:14:20] (03CR) 10jerkins-bot: [V: 04-1] Decommission terbium [puppet] - 10https://gerrit.wikimedia.org/r/445423 (owner: 10Muehlenhoff) [15:14:43] (03CR) 10Volans: "nitpick inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445421 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [15:17:27] (03PS2) 10Bstorm: dumps: point cloud vps hosts at labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/445408 (https://phabricator.wikimedia.org/T198420) (owner: 10Rush) [15:20:30] (03CR) 10Muehlenhoff: Update grants for terbium->mwmaint1001 migration and wasat rename (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/445421 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [15:20:36] (03PS2) 10Muehlenhoff: Update grants for terbium->mwmaint1001 migration and wasat rename [puppet] - 10https://gerrit.wikimedia.org/r/445421 (https://phabricator.wikimedia.org/T192092) [15:26:48] (03CR) 10Vgutierrez: [C: 031] Move NaiveBGPPeeringTestCase to test_peering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436766 (owner: 10Mark Bergsma) [15:26:54] 10Operations, 10ops-codfw, 10DBA, 10decommission, 10Patch-For-Review: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10RobH) [15:27:33] (03CR) 10Bstorm: [C: 032] dumps: point cloud vps hosts at labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/445408 (https://phabricator.wikimedia.org/T198420) (owner: 10Rush) [15:31:13] 10Operations: rename wasat to mwmaint2001 and reinstall it with stretch - https://phabricator.wikimedia.org/T193915 (10MoritzMuehlenhoff) >>! In T193915#4420024, @ayounsi wrote: > Let me (or DCops) know if/when we can rename the switch port description. Ack, I'll make a separate ticket for that. [15:31:26] (03PS1) 10RobH: decom db2064, remove prod dns [dns] - 10https://gerrit.wikimedia.org/r/445428 (https://phabricator.wikimedia.org/T195228) [15:31:50] (03CR) 10RobH: [C: 032] decom db2064, remove prod dns [dns] - 10https://gerrit.wikimedia.org/r/445428 (https://phabricator.wikimedia.org/T195228) (owner: 10RobH) [15:32:47] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10RobH) [15:33:01] 10Operations, 10ops-codfw, 10DBA, 10decommission: db2064 crashed and totally broken - decommission it - https://phabricator.wikimedia.org/T195228 (10RobH) a:05RobH>03Papaul [15:39:01] hey ops folks, looks like there's a major-ish npm-related virus in progress [15:39:15] infected eslint-related packages, which means it's potentially running on jenkins [15:39:47] https://github.com/eslint/eslint-scope/issues/39 and just in the past 10 minutes https://github.com/eslint/eslint/issues/10600 [15:39:50] so maybe more [15:39:51] cscott: We know [15:39:57] Reedy: ok, just checking. [15:39:58] https://lists.wikimedia.org/pipermail/wikitech-l/2018-July/090331.html [15:39:59] :) [15:40:16] i read twitter more than email it seems [15:40:27] heh [15:42:31] 10Operations, 10SRE-Access-Requests: +2 for Addshore on operations/puppet - https://phabricator.wikimedia.org/T199325 (10Jonas) 05Open>03declined [15:46:13] !log progressively pushing new (tighter) mgmt firewall policies [15:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:39] cscott: CI doesn't have any ~/.npmrc, so it doesn't appear to have anything to compromise [15:48:08] and theoretically it should be relatively contained filesystem wise by docker [15:48:14] legoktm: well, the compromise script came from a pastebin, so it could have been running anything, not just .npmrc harvesting [15:48:23] right [15:48:49] some travis CI scripts have npm credentials so they can publish a package once it is pushed to a release branch [15:49:00] i don't know if our travis-using folks have anything like that set up [15:49:01] !log downgrade apertium-fra-cat to apertium-fra-cat_1.2.0~r78602-1+wmf2 [15:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:16] !log rolling restart apertium-apy on scb nodes [15:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:27] (03PS3) 10Jcrespo: mariadb: Depool db1087 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445370 [15:54:39] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1087 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445370 (owner: 10Jcrespo) [15:56:21] (03Merged) 10jenkins-bot: mariadb: Depool db1087 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445370 (owner: 10Jcrespo) [15:57:55] (03CR) 10jenkins-bot: mariadb: Depool db1087 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445370 (owner: 10Jcrespo) [15:59:56] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1087 (duration: 00m 50s) [15:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] godog, moritzm, and _joe_: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:01:25] (03PS1) 10Andrew Bogott: nova.conf: remove RetryFilter from list of scheduler filters [puppet] - 10https://gerrit.wikimedia.org/r/445431 [16:02:23] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Max concurrent service checks reached on tegmen - https://phabricator.wikimedia.org/T199413 (10Volans) Shortly after a puppet run, this was logged: ``` Jul 11 09:31:12 tegmen systemd[1]: Starting Update the etcd last modified index for MediaWiki con... [16:04:31] (03CR) 10Andrew Bogott: [C: 032] nova.conf: remove RetryFilter from list of scheduler filters [puppet] - 10https://gerrit.wikimedia.org/r/445431 (owner: 10Andrew Bogott) [16:05:47] (03CR) 10Rush: [C: 031] "sure" [puppet] - 10https://gerrit.wikimedia.org/r/445431 (owner: 10Andrew Bogott) [16:15:25] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: extend mysql limits [puppet] - 10https://gerrit.wikimedia.org/r/445432 (https://phabricator.wikimedia.org/T196633) [16:22:26] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Max concurrent service checks reached on tegmen - https://phabricator.wikimedia.org/T199413 (10Volans) I've saved a part of the 30GB daemon log into `daemon.20180711.T199413.log`. I'll follow up with changes for the systemd unit/timer. [16:22:35] (03PS1) 10Jcrespo: mariadb: Add prometheus monitoring to labcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/445433 [16:23:23] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add prometheus monitoring to labcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/445433 (owner: 10Jcrespo) [16:23:38] (03CR) 10Smalyshev: [C: 031] [cirrus] allow term_freq and remove deprecated settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445399 (owner: 10DCausse) [16:27:18] (03PS2) 10Jcrespo: mariadb: Add prometheus monitoring to labcontrol1003 [puppet] - 10https://gerrit.wikimedia.org/r/445433 [16:37:23] (03PS2) 10Muehlenhoff: Reflect new name of wasat in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/445418 [16:37:41] uh [16:37:46] someone just changed the topic [16:38:31] (03CR) 10Muehlenhoff: [C: 032] Reflect new name of wasat in smokeping config [puppet] - 10https://gerrit.wikimedia.org/r/445418 (owner: 10Muehlenhoff) [16:45:09] !log restarting smokeping on netmon1002 to pick up new config after wasat rename [16:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:46] 10Operations, 10Tracking: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 (10ayounsi) [16:55:43] (03PS1) 10Muehlenhoff: Switch noc backend for codfw from wasat to mwmaint2001 [puppet] - 10https://gerrit.wikimedia.org/r/445438 [16:56:04] !log labcontrol1003:~# neutron net-update 7425e328-560c-4f00-8e99-706f3fb90bb4 --port_security_enabled=true [16:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:05] (03CR) 10Elukey: "> I wounder if it would be better to have specific params for the" [puppet] - 10https://gerrit.wikimedia.org/r/445304 (owner: 10Elukey) [17:00:05] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1700). [17:01:24] (03CR) 10Rush: [C: 032] cloudvps: eqiad1: extend mysql limits [puppet] - 10https://gerrit.wikimedia.org/r/445432 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [17:01:32] (03PS2) 10Rush: cloudvps: eqiad1: extend mysql limits [puppet] - 10https://gerrit.wikimedia.org/r/445432 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [17:13:16] (03PS1) 10Elukey: Revert "Suppress false alerts in sqoop mediawiki tables" [puppet] - 10https://gerrit.wikimedia.org/r/445439 [17:13:23] (03PS2) 10Elukey: Revert "Suppress false alerts in sqoop mediawiki tables" [puppet] - 10https://gerrit.wikimedia.org/r/445439 [17:13:47] (03CR) 10Elukey: [V: 032 C: 032] Revert "Suppress false alerts in sqoop mediawiki tables" [puppet] - 10https://gerrit.wikimedia.org/r/445439 (owner: 10Elukey) [17:18:51] 10Operations, 10Cloud-Services, 10Wikimedia-Mailing-lists: Find a better way to notify tool maintainers of schema and API changes - https://phabricator.wikimedia.org/T199234 (10MusikAnimal) [17:19:11] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:24:41] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:34:29] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Import some Analytics git puppet submodules to operations/puppet - https://phabricator.wikimedia.org/T188377 (10Nuria) 05Open>03Resolved [17:34:41] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:34:42] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) I think response times and number of timeouts are not a good metric for this t... [17:55:49] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) > I think response times and number of timeouts are not a good metric for this typ... [18:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:03:35] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) > ( ASK{ ?x ?y ?z };) does timeout from time to time. This is definitely the... [18:04:12] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) p:05Triage>03Normal [18:06:16] But I wanna swat! [18:06:48] SWAT ALL THE THINGS [18:07:21] I should start learning about running the train. :P [18:07:32] thats when the real fun begins [18:07:49] Niharika: we... could make that happen :) [18:08:20] Niharika: also, you could pull a Reedy and just go through the mw-config backlog and see what you want to review/merge/deploy :) [18:08:30] :D [18:08:34] I think Timo has been doing that recently [18:08:39] oooh [18:08:42] i might have some stuff [18:08:53] greg-g: Oh, is that his secret? I'll be happy to pitch in! :D [18:09:22] Niharika: if you want stuff to deploy I can provide you with stuff :P [18:09:23] "reedy spam" as it is affectionally called [18:09:35] *ately [18:09:36] addshore: Only important shit! :P [18:09:52] its very important :p [18:09:57] (03PS4) 10Addshore: Wikidata dispatch, disable dispatching for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430924 [18:10:13] (03PS8) 10Addshore: Wikidata dispatch, Use a LockManager with short TTL for testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) [18:10:17] ^That looks important. [18:10:24] (03PS4) 10Addshore: Revert "Wikidata dispatch, disable dispatching for testwikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430925 [18:10:25] https://gerrit.wikimedia.org/r/q/project:operations%252Fmediawiki-config+status:open [18:10:34] * Niharika bookmarks [18:10:57] Someone was looking for TemplateStyles deploying somewhere [18:11:22] (03Abandoned) 10Niharika29: Deploy GlobalPrefs to all production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420948 (owner: 10Niharika29) [18:11:50] Niharika: those 3 patches could be amazzzzing :D [18:12:47] addshore: Shall I deploy? [18:13:01] yup! lets start with the first one (turning off dispatching for testwikidata) [18:13:05] put 'em in the calendar if you do (I guess adam should) [18:13:12] I can ! :) [18:14:18] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430924 (owner: 10Addshore) [18:15:23] added :) [18:15:57] (03Merged) 10jenkins-bot: Wikidata dispatch, disable dispatching for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430924 (owner: 10Addshore) [18:16:16] addshore: No testing, right? [18:16:26] well, we can make sure it doesnt make the site explode :) [18:16:37] addshore: Okay, it's on mwdebug1002. [18:16:40] * addshore always loads things on mwdebug1002 even if im expecting nothing to break [18:16:41] *checks* [18:16:57] looks good to me [18:17:39] Syncing... [18:18:07] (03CR) 10jenkins-bot: Wikidata dispatch, disable dispatching for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430924 (owner: 10Addshore) [18:18:17] addshore: best to load on mwdebug *especailly* when you think it won't break anything [18:18:24] :) [18:18:24] !log niharika29@deploy1001 Synchronized wmf-config/Wikibase.php: Disable dispatching for testwikidatawiki https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/430924/ (duration: 00m 50s) [18:18:27] Good point. [18:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:30] addshore: ^^ [18:18:34] addshore: Next up is https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/395967/? [18:18:40] greg-g: indeed :P I mean, I have loaded stuff on there and tested it and still broken stuff [18:18:46] Niharika: yes [18:18:52] but first we need to tail tail -f /var/log/wikidata/dispatchChanges-testwikidatawiki.log on mwmaint1001 [18:19:03] and wait for the script to finish running [18:19:03] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [18:19:12] which should be in 10 mins D: [18:19:21] but merging it is fine for now, but no syncing :) [18:19:26] addshore: You running that or should I? [18:19:28] we can do the mwdebug1002 testing while we wait [18:19:32] Alright, no syncing. [18:19:38] * addshore is watching the log file :) i'll tell you when we can sync [18:20:13] Gotcha. addshore - to be sure, I can pull it on mwdebug1002 after it merges, right? [18:20:19] yup [18:20:27] 👍 [18:20:44] (03Merged) 10jenkins-bot: Wikidata dispatch, Use a LockManager with short TTL for testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [18:21:06] this evil thing only actually runs on mwmaint1001, but we can make sure the sites are all still alive while we wait the next mins :) [18:21:46] addshore: Right. I pulled it on mwdebug1002. [18:22:00] *checks* [18:22:28] everything looks alive and well [18:22:32] * addshore waits for this log file [18:22:43] (03CR) 10jenkins-bot: Wikidata dispatch, Use a LockManager with short TTL for testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [18:23:30] (03CR) 10Smalyshev: "@ArielGlenn the issues should be fixed now, could you check again?" [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) (owner: 10Smalyshev) [18:23:40] (03PS11) 10Smalyshev: Generate daily diffs for categories RDF [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T198356) [18:23:45] still 7 mins to wait Niharika :( [18:23:57] addshore: No worries! Lemme know when it's done. [18:24:05] that certainly is the one lame things about the run time of the script [18:24:15] well, one of many lame things about the script... but .... [18:24:30] 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10chasemp) ```+ /* Cloud public prefix via labnet100[45] */ + route 185.15.56.0/25 next-hop 10.64.22.4;``` [18:24:40] PROBLEM - Disk space on maps1001 is CRITICAL: DISK CRITICAL - free space: /srv 54393 MB (3% inode=99%) [18:26:47] 10Operations, 10Cloud-Services, 10netops: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10chasemp) >>! In T193496#4420713, @chasemp wrote: > ```+ /* Cloud public prefix via labnet100[45] */ > + route 185.15.56.0/25 next-hop 10.64.22.4;``` @ayounsi if ht... [18:28:21] 2 mins ... [18:28:55] I might think of a better thing to do than waiting for prod.... I guess actually killing the script would be fine, as it will be using a new lock manager when the script restarts.... so there wouldn't be any locks.... [18:30:10] Niharika: all ready! [18:30:13] sync away! [18:30:18] Alrighty! [18:30:57] (03CR) 10Daniel Kinzler: Wikidata dispatch, Use a LockManager with short TTL for testwikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [18:31:24] (03CR) 10Addshore: Wikidata dispatch, Use a LockManager with short TTL for testwikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/395967 (https://phabricator.wikimedia.org/T178652) (owner: 10Addshore) [18:32:05] !log niharika29@deploy1001 Synchronized wmf-config/: Wikidata dispatch, Use a LockManager with short TTL for testwikidata T178652 (duration: 00m 51s) [18:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:09] T178652: Wikidata dispatchers should use a LockManager with a short TTL - https://phabricator.wikimedia.org/T178652 [18:32:13] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430925 (owner: 10Addshore) [18:32:26] addshore: Just this one now, right?^ [18:32:31] yup [18:33:00] (03PS1) 10Addshore: Fix typo in docs for wikibase dispatch lock manager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445448 [18:33:03] Niharika: ^^ and that one fixing a typo [18:33:12] they can both go out together if you like [18:33:21] Sure! [18:33:39] and, i mean, if you have time and want to we could do real wikidata too :P [18:33:52] (03Merged) 10jenkins-bot: Revert "Wikidata dispatch, disable dispatching for testwikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430925 (owner: 10Addshore) [18:34:04] but naaah, best save that for another day [18:34:20] addshore: As you wish. :) [18:34:43] (03CR) 10Niharika29: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445448 (owner: 10Addshore) [18:36:47] (03PS2) 10Niharika29: Fix typo in docs for wikibase dispatch lock manager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445448 (owner: 10Addshore) [18:36:58] (03CR) 10Niharika29: [C: 032] Fix typo in docs for wikibase dispatch lock manager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445448 (owner: 10Addshore) [18:38:13] (03Merged) 10jenkins-bot: Fix typo in docs for wikibase dispatch lock manager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445448 (owner: 10Addshore) [18:38:15] (03CR) 10jenkins-bot: Revert "Wikidata dispatch, disable dispatching for testwikidatawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430925 (owner: 10Addshore) [18:38:22] addshore: Back on mwdebug1002 to test. [18:38:50] RECOVERY - Disk space on maps1001 is OK: DISK OK [18:38:55] Niharika: looks alive [18:38:56] go for it [18:40:30] !log niharika29@deploy1001 Synchronized wmf-config/Wikibase.php: Revert - Wikidata dispatch, disable dispatching for testwikidatawiki https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/430925/ (duration: 00m 49s) [18:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:21] thanks! [18:41:31] and now I wait 4 more mins and that log should start filling up again [18:42:24] (03CR) 10jenkins-bot: Fix typo in docs for wikibase dispatch lock manager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445448 (owner: 10Addshore) [18:45:11] Niharika: and it is all running again perfectly :) [18:46:11] Woohoo! [19:00:04] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T1900) [19:19:03] 10Operations, 10MediaWiki-API, 10MediaWiki-General-or-Unknown: Unit of measure in Quantity datatype when written with bots (Wikidata API) is not (correctly) shown in Wikidata pages. - https://phabricator.wikimedia.org/T199438 (10matej_suchanek) [19:19:19] 10Puppet, 10Cloud-Services, 10Toolforge, 10Goal: Fully puppetize Grid Engine - https://phabricator.wikimedia.org/T88711 (10Bstorm) T199276#4420812 is one thing needed for this. [19:22:58] (03PS1) 10Rush: openstack: profile::openstack::eqiad1::neutron::dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/445452 (https://phabricator.wikimedia.org/T167357) [19:45:07] (03PS1) 10Smalyshev: Enable fetching constraints for Updater [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) [19:45:56] (03CR) 10jerkins-bot: [V: 04-1] Enable fetching constraints for Updater [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [19:47:14] (03PS2) 10Smalyshev: Enable fetching constraints for Updater [puppet] - 10https://gerrit.wikimedia.org/r/445454 (https://phabricator.wikimedia.org/T192567) [20:04:31] (03CR) 10Rush: [C: 032] openstack: profile::openstack::eqiad1::neutron::dmz_cidr [puppet] - 10https://gerrit.wikimedia.org/r/445452 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush) [20:07:34] (03PS1) 10Rush: openstack: eqiad1 neutron bootstrap DNS server [puppet] - 10https://gerrit.wikimedia.org/r/445457 (https://phabricator.wikimedia.org/T196633) [20:08:21] (03CR) 10Rush: [C: 032] openstack: eqiad1 neutron bootstrap DNS server [puppet] - 10https://gerrit.wikimedia.org/r/445457 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [20:09:06] (03CR) 10Andrew Bogott: [C: 031] openstack: eqiad1 neutron bootstrap DNS server [puppet] - 10https://gerrit.wikimedia.org/r/445457 (https://phabricator.wikimedia.org/T196633) (owner: 10Rush) [20:24:45] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Request access to data for citation usage research - https://phabricator.wikimedia.org/T198662 (10Pirroh) @MoritzMuehlenhoff: I've accessed one of the analytics servers today (under the guidance of @Miriam) and everything seems to work pe... [20:27:31] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10cwdent) 05Resolved>03Open @ayounsi I reverted this since it wasn't the problem after all, please deploy 1531426993 at your leisure [20:35:33] (03PS1) 10Andrew Bogott: mwopenstackclients.py: allow specifying 'region' to the nova client [puppet] - 10https://gerrit.wikimedia.org/r/445463 [20:37:32] (03PS1) 10RobH: decom db1059 [dns] - 10https://gerrit.wikimedia.org/r/445465 (https://phabricator.wikimedia.org/T196606) [20:39:06] (03PS1) 10RobH: decom db1059 [puppet] - 10https://gerrit.wikimedia.org/r/445527 (https://phabricator.wikimedia.org/T196606) [20:39:16] (03CR) 10RobH: [C: 032] decom db1059 [dns] - 10https://gerrit.wikimedia.org/r/445465 (https://phabricator.wikimedia.org/T196606) (owner: 10RobH) [20:39:53] (03CR) 10RobH: [C: 032] decom db1059 [puppet] - 10https://gerrit.wikimedia.org/r/445527 (https://phabricator.wikimedia.org/T196606) (owner: 10RobH) [20:54:08] (03PS2) 10Andrew Bogott: mwopenstackclients.py: allow specifying 'region' to the nova client [puppet] - 10https://gerrit.wikimedia.org/r/445463 [20:55:45] (03CR) 10Krinkle: [C: 04-1] Do not leak local $wgWBShared… variables to th eglobal scope [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444632 (owner: 10Thiemo Kreuz (WMDE)) [20:58:18] (03CR) 10Krinkle: [C: 031] "per current master MW / (MediaWiki Codesearch), setting ExternalDiffEngine = 'wikidiff2' just results in getEngine flipping it back to fal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445128 (owner: 10Gergő Tisza) [21:12:04] !log advertising 185.15.56.0/24 to the DFZ - T193496 [21:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:08] T193496: Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 [21:16:02] (03PS1) 10Framawiki: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) [21:16:29] ^^ reviews are welcome :) [21:17:19] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:21:04] (03PS2) 10Framawiki: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) [21:22:44] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:27:53] (03PS3) 10Framawiki: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) [21:29:31] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:32:25] (03PS4) 10Framawiki: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) [21:33:25] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10ayounsi) 1531426993 Pushed. [21:33:38] 10Operations, 10fundraising-tech-ops, 10netops: New PFW policy for Amazon - https://phabricator.wikimedia.org/T199341 (10ayounsi) 05Open>03Resolved [21:34:30] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:42:53] (03CR) 10Framawiki: "Patch is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:47:25] (03CR) 10Urbanecm: "No, it isn't. See https://phabricator.wikimedia.org/T199480. I'll fix it and add the patch as the dependency of this one, so jenkins will " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:49:53] (03PS1) 10Urbanecm: Fix unexpected space in pswiki's ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) [21:51:47] (03PS5) 10Urbanecm: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:53:02] (03CR) 10Framawiki: Fix unexpected space in pswiki's ExtraNamespaces definition (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) (owner: 10Urbanecm) [21:53:39] (03CR) 10jerkins-bot: [V: 04-1] Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [21:55:25] (03PS2) 10Urbanecm: Fix unexpected space in pswiki's ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) [21:56:33] (03PS6) 10Framawiki: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) [21:56:42] (03CR) 10Framawiki: [C: 031] Fix unexpected space in pswiki's ExtraNamespaces definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445541 (https://phabricator.wikimedia.org/T199480) (owner: 10Urbanecm) [21:57:32] (03PS7) 10Urbanecm: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) (owner: 10Framawiki) [22:03:27] (03PS8) 10Framawiki: Test spaces in ExtraNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/445537 (https://phabricator.wikimedia.org/T199162) [22:04:10] PROBLEM - kubelet operational latencies on kubestage1002 is CRITICAL: instance=kubestage1002.eqiad.wmnet operation_type=stop_container https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:05:19] RECOVERY - kubelet operational latencies on kubestage1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:19:28] 10Operations, 10ops-ulsfo, 10Traffic, 10netops: troubleshoot cr3/cr4 link - https://phabricator.wikimedia.org/T196030 (10ayounsi) From JTAC: > I have to track with engineering is PR1372815. Currently it is not viewable publicly as still work in progress. They are also still trying to source Fiberstore opt... [22:24:59] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:29:29] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:32:50] PROBLEM - SSH on ms-be1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:33:49] RECOVERY - SSH on ms-be1036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [22:33:50] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:42:48] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10thcipriani) [22:45:41] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Helm test failing for CI namespace - https://phabricator.wikimedia.org/T199489 (10thcipriani) [22:55:00] (03PS1) 10Bstorm: wiki replicas: hide abuse_filter_logs that are deleted [puppet] - 10https://gerrit.wikimedia.org/r/445549 (https://phabricator.wikimedia.org/T190564) [22:58:20] (03CR) 10Bstorm: [C: 032] wiki replicas: hide abuse_filter_logs that are deleted [puppet] - 10https://gerrit.wikimedia.org/r/445549 (https://phabricator.wikimedia.org/T190564) (owner: 10Bstorm) [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180712T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:07:40] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:09:47] OK if I SWAT something given there's nothing happening? [23:10:57] I think that'd be fine [23:12:52] Cool. [23:13:08] (03PS2) 10Jforrester: Timeless is enabled everywhere, remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444776 (owner: 10Reedy) [23:13:26] (03CR) 10Jforrester: [C: 032] Timeless is enabled everywhere, remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444776 (owner: 10Reedy) [23:14:45] (03Merged) 10jenkins-bot: Timeless is enabled everywhere, remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444776 (owner: 10Reedy) [23:17:59] (03CR) 10jenkins-bot: Timeless is enabled everywhere, remove config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444776 (owner: 10Reedy) [23:20:34] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT Timeless config simplification, part I Ifc02818b6e (duration: 00m 50s) [23:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:17] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT Timeless config simplification, part II Ifc02818b6e (duration: 00m 49s) [23:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:14] !log jforrester@deploy1001 Synchronized php-1.32.0-wmf.12/extensions/VisualEditor: SWAT VisualEditor: Add missing i18n keys to manifest T198064 (duration: 00m 52s) [23:32:24] (03PS1) 10Nuria: [WIP] Changing dimensions to be read as numbers [puppet] - 10https://gerrit.wikimedia.org/r/445553 (https://phabricator.wikimedia.org/T167494) [23:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:39] T198064: visualeditor-clearbutton-tooltip shouldn't be used in link inspector to remove a link - https://phabricator.wikimedia.org/T198064 [23:33:00] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Changing dimensions to be read as numbers [puppet] - 10https://gerrit.wikimedia.org/r/445553 (https://phabricator.wikimedia.org/T167494) (owner: 10Nuria) [23:33:41] All right, I'm done. [23:38:47] How do I find the trace of a fatal in prod? [23:41:30] PROBLEM - SSH on ms-be1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:02] you can dig around in the discover section of logstash and usually find it. I usually grep for it on mwlog1001 in /srv/mw-log [23:42:29] RECOVERY - SSH on ms-be1036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [23:42:40] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:49:30] thcipriani: Thanks. In the end I gave up. :-) [23:49:46] I've tried that approach from time to time [23:50:20] PROBLEM - SSH on ms-be1036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:50:26] Maybe someone will be able to do something with https://phabricator.wikimedia.org/T199492 anyway. [23:52:29] RECOVERY - SSH on ms-be1036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) [23:53:43] (03PS6) 10BryanDavis: Allow PuppetDB use on standalone puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/435631 (https://phabricator.wikimedia.org/T153577) (owner: 10Alex Monk) [23:53:49] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:55:55] 10Puppet, 10Toolforge, 10Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577 (10bd808) [23:55:57] 10Puppet, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Set up puppet exported resources to collect ssh host keys for beta - https://phabricator.wikimedia.org/T72792 (10bd808) [23:56:00] 10Operations, 10Traffic, 10Patch-For-Review: Create and deploy a centralized letsencrypt service - https://phabricator.wikimedia.org/T194962 (10bd808) [23:56:27] 10Puppet, 10Toolforge, 10Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577 (10bd808) a:05scfc>03None [23:58:30] 10Puppet, 10Toolforge, 10Patch-For-Review: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577 (10bd808) a:03Krenair Assigning to @Krenair as he has patches in gerrit that have made this work for the deployment-prep project that I believe are ready to be merged...