[01:25:09] PROBLEM - puppet last run on ms-be1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:49:56] (03CR) 10Krinkle: "Schedule for tomorrow's SWAT https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T2300" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [01:50:29] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:53:09] RECOVERY - puppet last run on ms-be1019 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:06:39] PROBLEM - puppet last run on rhodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:18:29] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [02:31:24] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.15) (duration: 12m 36s) [02:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:33:40] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:36:39] RECOVERY - puppet last run on rhodium is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [02:52:39] PROBLEM - puppet last run on db1076 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:02:39] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [03:20:39] RECOVERY - puppet last run on db1076 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:22:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 671.93 seconds [03:26:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 233.07 seconds [03:30:39] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:45:39] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:46:40] (03CR) 10BryanDavis: [C: 031] Use custom LogstashFormatter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/323492 (https://phabricator.wikimedia.org/T145133) (owner: 10Gergő Tisza) [03:59:39] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [04:08:09] PROBLEM - puppet last run on es1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:13:39] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [04:32:39] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:37:09] RECOVERY - puppet last run on es1011 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [04:48:39] PROBLEM - puppet last run on db1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:00:40] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [05:17:40] RECOVERY - puppet last run on db1009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [05:19:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.010 second response time [05:27:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.011 second response time [05:39:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.006 second response time [05:43:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.005 second response time [05:55:29] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 1753 MB (3% inode=97%) [06:00:29] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 1706 MB (3% inode=97%) [06:02:29] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 1773 MB (3% inode=97%) [06:09:09] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:15:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.005 second response time [06:26:29] RECOVERY - Disk space on graphite1001 is OK: DISK OK [06:33:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.013 second response time [06:38:09] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:45:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.014 second response time [06:55:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.010 second response time [06:56:31] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342579 [06:56:37] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342579 [06:59:16] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342579 (owner: 10Marostegui) [07:00:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342579 (owner: 10Marostegui) [07:00:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342579 (owner: 10Marostegui) [07:01:36] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 - T132416 (duration: 00m 41s) [07:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:44] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:02:23] (03PS5) 10Mbch331: Remove exception on Other Projects sidebar for Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) [07:02:41] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342580 (https://phabricator.wikimedia.org/T132416) [07:04:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342580 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:05:22] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342580 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:06:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342580 (https://phabricator.wikimedia.org/T132416) (owner: 10Marostegui) [07:07:03] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1080 - T132416 (duration: 00m 41s) [07:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:08] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [07:07:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.013 second response time [07:07:48] !log Deploy alter table enwiki.revision db1080 - T132416 [07:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:35] (03CR) 10Marostegui: [C: 031] mariadb: clear build related files [puppet] - 10https://gerrit.wikimedia.org/r/342506 (owner: 10Hashar) [07:10:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.011 second response time [07:17:46] 06Operations, 06Labs: labtestcontrol2001: cron-spam from invoke-rc.d atop _cron - https://phabricator.wikimedia.org/T159532#3097164 (10elukey) [07:17:54] 06Operations: Cronspam from mwlog* - https://phabricator.wikimedia.org/T156151#3097165 (10elukey) [07:20:12] (03PS1) 10Giuseppe Lavagetto: puppetmaster: remove templates/ templatedir [puppet] - 10https://gerrit.wikimedia.org/r/342582 [07:20:16] <_joe_> akosiaris: ^^ [07:21:59] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:22:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.006 second response time [07:22:58] 06Operations, 07Puppet, 06Labs, 10Traffic, 07Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412#3097172 (10Joe) @Ciencia_Al_Poder care to explain why did you remove the "easy" tag? In general, I'd like to see a comment explaining act... [07:23:57] (03CR) 10Urbanecm: [C: 031] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) (owner: 10Mbch331) [07:26:44] !log installing python-imaging/pillow security updates on trusty (jessie already fixed) [07:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:14] (03CR) 10Muehlenhoff: microsites: convert to profile/role-structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn) [07:49:59] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:57:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.005 second response time [08:03:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:05:39] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:07:44] !log installing icoutils security update on trusty (jessie already fixed) [08:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:53] peak of 503s, seems already recovered https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?panelId=7&fullscreen&from=now-3h&to=now [08:09:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.004 second response time [08:11:59] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:13:39] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:15:21] !log installing icu security updates on trusty (jessie already fixed) [08:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:30] 06Operations, 10ops-eqiad: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3097215 (10Marostegui) [08:19:53] 06Operations, 10ops-eqiad: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3097231 (10Marostegui) [08:19:57] 06Operations, 10DBA, 13Patch-For-Review: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#3097232 (10Marostegui) [08:21:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.017 second response time [08:29:05] 06Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1070 - https://phabricator.wikimedia.org/T158969#3097248 (10Marostegui) [08:29:08] 06Operations, 10ops-eqiad: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3097247 (10Marostegui) [08:31:08] (03PS1) 10Elukey: Increase the squid's logrotate log retetion to 2 [puppet] - 10https://gerrit.wikimedia.org/r/342586 (https://phabricator.wikimedia.org/T153940) [08:33:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.007 second response time [08:36:08] (03PS1) 10Gehel: osm - correct waterline import logrotate configuration [puppet] - 10https://gerrit.wikimedia.org/r/342587 [08:38:09] !log moved some log files from /var/log/upstart/$logname.log.1 to /var/log/upstart/$logname.log.1.bis on labvirt1014, labtestvirt2001, labtestnet2001, labnet1001 to reduce cronspam [08:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:19] andrewbogott, madhuvishy --^ [08:38:45] I haven't investigated what happened but I didn't want to delete those files [08:39:59] gehel: you are awesome, thanks [08:40:10] time to fix 10 minutes :D [08:40:55] elukey: I'm also the one who pushed a crappy config! [08:41:35] (03CR) 10Elukey: [C: 032] osm - correct waterline import logrotate configuration [puppet] - 10https://gerrit.wikimedia.org/r/342587 (owner: 10Gehel) [08:42:29] elukey: thanks for the merge! [08:42:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.011 second response time [09:00:04] addshore: Dear anthropoid, the time has come. Please deploy InterwikiSorting (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T0900). [09:00:09] o/ [09:05:43] (03CR) 10Addshore: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [09:05:50] (03PS5) 10Addshore: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) [09:10:11] (03CR) 10Addshore: [C: 032] wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [09:12:08] (03Merged) 10jenkins-bot: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [09:12:17] (03CR) 10jenkins-bot: wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341033 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [09:15:50] !log addshore@tin Synchronized dblists/interwikisorting.dblist: [[gerrit:341033|wmgUseInterwikiSorting true for wikidata clients, excluding wikipedias]] T150183 (duration: 00m 42s) [09:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:57] T150183: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183 [09:16:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.007 second response time [09:19:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.008 second response time [09:26:49] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:27:15] (03CR) 10Jcrespo: "Why not just merge 337837 , which has been waiting for weeks?" [puppet] - 10https://gerrit.wikimedia.org/r/342582 (owner: 10Giuseppe Lavagetto) [09:30:43] (03PS2) 10Giuseppe Lavagetto: Add redis switching task, some more stages boilerplate [switchdc] - 10https://gerrit.wikimedia.org/r/342498 [09:30:54] (03PS1) 10Filippo Giunchedi: puppetmaster: bump client_max_body_size to 30m [puppet] - 10https://gerrit.wikimedia.org/r/342591 [09:31:20] <_joe_> jynus: d'oh you're right [09:31:29] PROBLEM - check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/toolscron - 185 bytes in 0.021 second response time [09:31:54] (03Abandoned) 10Giuseppe Lavagetto: puppetmaster: remove templates/ templatedir [puppet] - 10https://gerrit.wikimedia.org/r/342582 (owner: 10Giuseppe Lavagetto) [09:32:19] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/337837 (https://phabricator.wikimedia.org/T95158) (owner: 10Jcrespo) [09:32:30] <_joe_> jynus: I'll merge it [09:32:36] akosiaris: heh the production catalog is indeed 20m for einsteinium, https://gerrit.wikimedia.org/r/#/c/342591/ [09:33:32] <_joe_> godog: ? [09:34:06] <_joe_> wow [09:34:43] _joe_: yesterday I merged a patch to add codfw PDUs and it broke puppet on icinga machines, then it was too late/tired to investigate properly [09:35:29] RECOVERY - check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.005 second response time [09:36:41] (03PS1) 10Elukey: Add the Jmxtrans configuration for the MapReduce History server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) [09:40:26] (03CR) 10Filippo Giunchedi: "For reference, the top 20 biggest catalogs as of today" [puppet] - 10https://gerrit.wikimedia.org/r/342591 (owner: 10Filippo Giunchedi) [09:43:49] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:44:26] (03CR) 10Filippo Giunchedi: [C: 031] Increase the squid's logrotate log retetion to 2 [puppet] - 10https://gerrit.wikimedia.org/r/342586 (https://phabricator.wikimedia.org/T153940) (owner: 10Elukey) [09:46:17] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me.. and shouldn't even require an nginx restart to apply so we should not witness any errors either" [puppet] - 10https://gerrit.wikimedia.org/r/342591 (owner: 10Filippo Giunchedi) [09:47:16] (03CR) 10Alexandros Kosiaris: [C: 031] puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/337837 (https://phabricator.wikimedia.org/T95158) (owner: 10Jcrespo) [09:49:17] (03CR) 10Filippo Giunchedi: [C: 032] puppetmaster: bump client_max_body_size to 30m [puppet] - 10https://gerrit.wikimedia.org/r/342591 (owner: 10Filippo Giunchedi) [09:51:30] mhh nginx got restarted by puppet, there might be errors [09:52:00] (03PS2) 10Elukey: Increase the squid's logrotate log retention to 2 [puppet] - 10https://gerrit.wikimedia.org/r/342586 (https://phabricator.wikimedia.org/T153940) [09:52:34] (03CR) 10Elukey: "Only fixed the typo in the commit msg :)" [puppet] - 10https://gerrit.wikimedia.org/r/342586 (https://phabricator.wikimedia.org/T153940) (owner: 10Elukey) [09:52:49] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:53:15] * godog grabs umbrella [09:53:19] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:55:49] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [10:00:20] (03PS3) 10Elukey: Increase the squid's logrotate log retention to 2 [puppet] - 10https://gerrit.wikimedia.org/r/342586 (https://phabricator.wikimedia.org/T153940) [10:00:55] (03CR) 10Elukey: [V: 032 C: 032] Increase the squid's logrotate log retention to 2 [puppet] - 10https://gerrit.wikimedia.org/r/342586 (https://phabricator.wikimedia.org/T153940) (owner: 10Elukey) [10:04:38] (03PS1) 10Filippo Giunchedi: Revert "Revert "facilities: add codfw PDUs"" [puppet] - 10https://gerrit.wikimedia.org/r/342598 [10:04:45] (03PS2) 10Filippo Giunchedi: Revert "Revert "facilities: add codfw PDUs"" [puppet] - 10https://gerrit.wikimedia.org/r/342598 [10:06:11] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Revert "facilities: add codfw PDUs"" [puppet] - 10https://gerrit.wikimedia.org/r/342598 (owner: 10Filippo Giunchedi) [10:11:45] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [10:21:15] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [10:21:45] RECOVERY - puppet last run on elastic1051 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:22:15] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:25:15] PROBLEM - MariaDB Slave SQL: s3 on db2057 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table mediawikiwiki.echo_event doesnt exist on query. Default database: mediawikiwiki. [Query snipped] [10:26:52] ^I will fix that [10:30:15] RECOVERY - MariaDB Slave SQL: s3 on db2057 is OK: OK slave_sql_state Slave_SQL_Running: Yes [10:36:08] (03PS4) 10Jcrespo: puppet: Remove templatedir setting [puppet] - 10https://gerrit.wikimedia.org/r/337837 (https://phabricator.wikimedia.org/T95158) [10:43:15] 06Operations, 07Puppet, 06Labs, 10Traffic, 07Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412#3066827 (10Ciencia_Al_Poder) @Joe I found the easy tag is not suitable/applicable to this task, that's why I removed it Feel free to poke... [10:47:05] 06Operations, 07Puppet, 06Labs, 10Traffic, 07Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412#3066827 (10MoritzMuehlenhoff) I'd say let's add a few gerrit links of style conversions which have already landed to the task description,... [10:47:45] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:51:15] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [11:01:31] (03PS1) 10Jcrespo: mariadb: Repool db1054 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342603 [11:04:02] there is some problem on db1083 [11:07:18] it had an ALTER table ran yesterday [11:07:34] and was depooled until the morning [11:07:39] what are you seeing? [11:15:45] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [11:17:37] <_joe_> jynus: I am merging your templatedir patch [11:17:52] it is actually not mine, but tim's [11:32:56] (03PS1) 10Hashar: Migrate typos check to a rake task [puppet] - 10https://gerrit.wikimedia.org/r/342604 (https://phabricator.wikimedia.org/T119140) [11:46:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342607 [11:46:13] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342607 [11:47:11] 06Operations, 10Pybal, 10Traffic: pybal stops logging - https://phabricator.wikimedia.org/T160405#3097560 (10ema) [11:47:25] 06Operations, 10Pybal, 10Traffic: pybal stops logging - https://phabricator.wikimedia.org/T160405#3097572 (10ema) p:05Triage>03Normal [11:49:08] (03PS1) 10Ema: pybal: use 'rsyslog rotate' in logrotate configuration [puppet] - 10https://gerrit.wikimedia.org/r/342608 (https://phabricator.wikimedia.org/T160405) [11:57:12] 06Operations, 10Pywikibot-core, 10Traffic, 07HTTPS, and 2 others: Prepare pywikibot for http -> https switch in entity uri - https://phabricator.wikimedia.org/T159956#3097590 (10ema) [11:57:34] (03CR) 10Muehlenhoff: "We have a group ldap-admins in data.yaml, which grants LDAP tool access to two non-ops users. To avoid confusion, I've remove ldap_admins " [puppet] - 10https://gerrit.wikimedia.org/r/342008 (https://phabricator.wikimedia.org/T157131) (owner: 10Muehlenhoff) [11:59:28] (03PS1) 10Urbanecm: Add d to enwikisource's import list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342609 (https://phabricator.wikimedia.org/T160403) [12:00:04] addshore: Respected human, time to deploy InterwikiSorting (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T1200). Please do the needful. [12:00:07] o/ [12:00:15] (03PS2) 10Muehlenhoff: Harmomise group type for LDAP admin access [puppet] - 10https://gerrit.wikimedia.org/r/342008 (https://phabricator.wikimedia.org/T157131) [12:00:46] (03PS5) 10Addshore: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) [12:00:50] (03CR) 10Addshore: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [12:07:17] (03CR) 10Addshore: [C: 032] wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [12:08:39] (03Merged) 10jenkins-bot: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [12:08:48] (03CR) 10jenkins-bot: wmgUseInterwikiSorting true for all wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341034 (https://phabricator.wikimedia.org/T150183) (owner: 10Addshore) [12:08:49] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3096795 (10Liuxinyu970226) See also: T58414 [12:09:03] 06Operations, 10Mail: Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T58414#616526 (10Liuxinyu970226) See also: T160381 [12:12:44] (03PS1) 10Addshore: Revert "Add interwikisorting to CommonSettings $wikiTags" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342611 [12:12:48] (03PS2) 10Addshore: Revert "Add interwikisorting to CommonSettings $wikiTags" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342611 [12:12:53] (03CR) 10Addshore: [C: 032] Revert "Add interwikisorting to CommonSettings $wikiTags" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342611 (owner: 10Addshore) [12:13:15] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:13:15] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50585 bytes in 0.022 second response time [12:13:35] PROBLEM - Nginx local proxy to apache on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50633 bytes in 0.036 second response time [12:13:48] mwdebug1002 is me, fixing now [12:13:55] PROBLEM - Apache HTTP on mwdebug1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50585 bytes in 0.021 second response time [12:14:08] (03Merged) 10jenkins-bot: Revert "Add interwikisorting to CommonSettings $wikiTags" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342611 (owner: 10Addshore) [12:14:20] (03CR) 10jenkins-bot: Revert "Add interwikisorting to CommonSettings $wikiTags" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342611 (owner: 10Addshore) [12:14:35] RECOVERY - Nginx local proxy to apache on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 0.053 second response time [12:14:45] and mwdebug1002 is back :) [12:14:56] RECOVERY - Apache HTTP on mwdebug1002 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.070 second response time [12:15:11] (03PS3) 10Gehel: elasticsearch: align static and persistent config [puppet] - 10https://gerrit.wikimedia.org/r/342490 [12:15:15] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 71733 bytes in 0.148 second response time [12:16:30] (03CR) 10Gehel: [C: 032] elasticsearch: align static and persistent config [puppet] - 10https://gerrit.wikimedia.org/r/342490 (owner: 10Gehel) [12:18:21] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: T150183 wmgUseInterwikiSorting true for all wikidata clients [[gerrit:341034|#1]] [[gerrit:342611|#2]] PT 1/4 (duration: 00m 52s) [12:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:28] T150183: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183 [12:19:18] !log addshore@tin Synchronized wmf-config/CommonSettings.php: T150183 wmgUseInterwikiSorting true for all wikidata clients [[gerrit:341034|#1]] [[gerrit:342611|#2]] PT 2/4 (duration: 00m 41s) [12:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:18] !log addshore@tin Synchronized docroot/: T150183 wmgUseInterwikiSorting true for all wikidata clients [[gerrit:341034|#1]] [[gerrit:342611|#2]] PT 3/4 NOOP (duration: 00m 44s) [12:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:25] (03PS1) 10Gehel: osm - append to log file, dont replace it [puppet] - 10https://gerrit.wikimedia.org/r/342612 [12:24:18] !log addshore@tin Synchronized dblists/: T150183 wmgUseInterwikiSorting true for all wikidata clients [[gerrit:341034|#1]] [[gerrit:342611|#2]] PT 4/4 (duration: 00m 41s) [12:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:24] T150183: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183 [12:24:33] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1054 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342603 (owner: 10Jcrespo) [12:25:31] (03Merged) 10jenkins-bot: mariadb: Repool db1054 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342603 (owner: 10Jcrespo) [12:25:56] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3097639 (10Addshore) InterwikiSorting is now deployed to all wikis that already had sorting enabled (ie,... [12:27:23] (03CR) 10jenkins-bot: mariadb: Repool db1054 with full weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342603 (owner: 10Jcrespo) [12:28:12] !log stopping mariadb on db1057, preparing to backup and reimage [12:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:31] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1054 with full weight after warmup (duration: 00m 40s) [12:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:47] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342616 (https://phabricator.wikimedia.org/T128546) [12:40:36] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3097658 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['analy... [12:40:40] (03PS13) 10Zppix: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Should be merged with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [12:41:15] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [12:41:27] !log reimage analytics1043 to Debian Jessie [12:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:50] 06Operations, 06Labs: Remove linux kernel 3.16 from the jessie image on labs - https://phabricator.wikimedia.org/T159990#3097660 (10faidon) 05Open>03Invalid Sounds a lot like an [[ http://xyproblem.info/ | XY problem ]] in general, please avoid opening tasks like that :) In addition to what has been menti... [12:49:54] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3097663 (10faidon) Just as a general comment, as long as the package name remains the same (`librdkafka1`), the... [12:54:51] 06Operations, 10netops: Audit and cleanup border-in ACL on core routers - https://phabricator.wikimedia.org/T160055#3087244 (10faidon) The first part is true and I couldn't figure out why -- I know why I removed the IPv6 multicast ranges (basically: NDP), but the question about 224/4 still remains. It may have... [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T1300). Please do the needful. [13:00:04] kart_ and jan_drewniak: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:20] !log restarting elasticsearch on relforge1001 to test gelf appender [13:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:38] * kart_ is here [13:00:55] Who is SWATing? [13:00:56] Hello. I can SWAT today. [13:01:19] Dereckson: nice. Mine is DB index update. Bit unusual one. [13:01:53] o/ (hoping for a no-frills portal deploy today :P ) [13:02:25] kart_: we'll do it afterwards in this case [13:02:47] kart_: you've a green light from jynus? [13:02:57] (if needed) [13:03:23] (03CR) 10Dereckson: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342616 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:03:41] Dereckson: Change is already reviewed by jynus in T146450 [13:03:41] T146450: Unable to translate an article for which another user has a deleted draft - https://phabricator.wikimedia.org/T146450 [13:04:49] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342616 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:05:02] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342616 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [13:05:13] kart_: that's a DBA operation, please follow the process at https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change [13:05:32] Nikerabbit: ^ [13:06:45] 06Operations, 10Analytics, 10Analytics-Cluster, 13Patch-For-Review, 15User-Elukey: Reimage all the Hadoop worker nodes to Debian Jessie - https://phabricator.wikimedia.org/T160333#3097694 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['analytics1043.eqiad.wmnet'] ``` and were **ALL** succe... [13:06:46] (03PS1) 10Gehel: elasticsearch - enable shipping logs to logstash for elasticsearch 5.x [puppet] - 10https://gerrit.wikimedia.org/r/342618 [13:07:13] kart_: Nikerabbit: mainly it's to document the change you wish, state it's already tested on labs, and j.ynus or m.arostegui will apply it [13:07:38] Dereckson: fine [13:07:43] kart_: let's do taht then [13:07:44] Unattended DB deployment is currently limited to create new tables [13:07:48] Depending on the size of the table in production depends on how it can be applied [13:07:54] If it's empty/tiny, we can just change indexes [13:08:09] If it's got a reasonable amount of data, chances are it'll be done with OSC [13:08:47] jan_drewniak: portals bump live on mwdebug1002 [13:09:03] Reasonable data, I would say. But change (new indexes has been already reviewed) and applied in Labs/Beta. [13:09:30] Dereckson: Nikerabbit: Let's reschedule this. [13:09:30] yes but jynus left a comment on Gerrit schema change procedure should be followed [13:09:48] I missed that :/ [13:10:01] Dereckson: looking good on mwdebug1002 [13:10:04] kart_: I'm filing a bug [13:10:15] Nikerabbit: thanks. [13:11:49] (03CR) 10Gehel: "Since the ES5 upgrade will generate a sizeable logspam, we might want to merge this only after the elasticsearch eqiad cluster has been up" [puppet] - 10https://gerrit.wikimedia.org/r/342618 (owner: 10Gehel) [13:12:40] kart_: here: https://gerrit.wikimedia.org/r/#/c/337409/ — I have no issue with the proposed schema change, just a comment on code style. Please follow procedure as usual for deployment when ready. https://wikitech.wikimedia.org/wiki/Schema_changes This should be a very easy and fast to deploy, probably a 5-minute change. I have checked ahead of time that no collisions will happen when [13:12:46] the unique is extended. [13:13:38] jan_drewniak: ack'ed [13:14:27] !log dereckson@tin Synchronized portals/prod/wikipedia.org/assets: (no justification provided) (duration: 00m 41s) [13:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:09] !log dereckson@tin Synchronized portals: (no justification provided) (duration: 00m 41s) [13:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:40] 255835 rows in that table [13:15:43] Dereckson: Thanks! Yes. We missed it badly. [13:15:44] That's some replag waiting to happen :) [13:15:57] 06Operations: Upgrading python-phabricator in trusty - https://phabricator.wikimedia.org/T160408#3097718 (10MoritzMuehlenhoff) [13:18:58] !log started redis-cli --bigkeys -i 0.1 on rdb1008 (eqiad jobqueue slave) [13:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:50] 06Operations: Upgrading python-phabricator in trusty - https://phabricator.wikimedia.org/T160408#3097742 (10MoritzMuehlenhoff) Backported packages are at https://people.wikimedia.org/~jmm/phab/ [13:24:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "OK, but now you need a logrotate rule as well :-)" [puppet] - 10https://gerrit.wikimedia.org/r/342612 (owner: 10Gehel) [13:25:41] (03CR) 10Gehel: "yep, the logrotate rule is already in place (not that it made much sense up to now)" [puppet] - 10https://gerrit.wikimedia.org/r/342612 (owner: 10Gehel) [13:25:49] 06Operations: Upgrading python-phabricator in trusty - https://phabricator.wikimedia.org/T160408#3097747 (10mmodell) I don't know of anything on iridium that's actually using that. @chasemp might remember? [13:26:47] 06Operations: Upgrading python-phabricator in trusty - https://phabricator.wikimedia.org/T160408#3097758 (10mmodell) Unless chase remembers something that I am forgetting, then I'd say go ahead with the new version. Anything that depends on the old version can be found and fixed if it breaks. [13:29:19] (03PS1) 10DCausse: Disable updateSuggesterIndex cron take 2 [puppet] - 10https://gerrit.wikimedia.org/r/342624 [13:29:23] 06Operations: Upgrading python-phabricator in trusty - https://phabricator.wikimedia.org/T160408#3097761 (10chasemp) Turns out labcontrol* has the package but I don't know of anything that needs it. I agree, and I'll confirm as much as possible but it's not a blocker. [13:30:06] (03CR) 10jerkins-bot: [V: 04-1] Disable updateSuggesterIndex cron take 2 [puppet] - 10https://gerrit.wikimedia.org/r/342624 (owner: 10DCausse) [13:31:40] (03CR) 10Alexandros Kosiaris: [C: 031] "ok, switching to +1 then" [puppet] - 10https://gerrit.wikimedia.org/r/342612 (owner: 10Gehel) [13:33:06] (03PS2) 10DCausse: Disable updateSuggesterIndex cron take 2 [puppet] - 10https://gerrit.wikimedia.org/r/342624 [13:33:23] 06Operations, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3097779 (10elukey) I ran redis-cli --big-keys -i 0.1 for rdb1008:6379 (slaveof rdb1007) and got this result: ``` -------- summary ------- Sampled 4000451 keys in the keyspace! Total key... [13:35:18] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:37:48] (03CR) 10Gehel: [C: 032] osm - append to log file, dont replace it [puppet] - 10https://gerrit.wikimedia.org/r/342612 (owner: 10Gehel) [13:40:48] jan_drewniak: *new* portal version on mwdebug1002 [13:42:33] (03CR) 10Gehel: [C: 032] Disable updateSuggesterIndex cron take 2 [puppet] - 10https://gerrit.wikimedia.org/r/342624 (owner: 10DCausse) [13:42:38] (03PS3) 10Gehel: Disable updateSuggesterIndex cron take 2 [puppet] - 10https://gerrit.wikimedia.org/r/342624 (owner: 10DCausse) [13:42:58] (03PS1) 10Alexandros Kosiaris: Fix prometheus textfile exporter cronspam [puppet] - 10https://gerrit.wikimedia.org/r/342625 [13:43:21] Dereckson: ok, now i'm for sure looking at the right page [13:43:27] and it's fine [13:43:31] ok [13:45:18] !log dereckson@tin Synchronized portals/prod/wikipedia.org/assets: Bump to e576c18522ff (duration: 00m 41s) [13:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:37] (03PS2) 10Alexandros Kosiaris: Fix prometheus OSM sync lag exporter cronspam [puppet] - 10https://gerrit.wikimedia.org/r/342625 [13:46:00] !log dereckson@tin Synchronized portals: Bump to e576c18522ff (duration: 00m 41s) [13:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:16] jan_drewniak: Done, with the URL purge at the end [13:48:36] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3097832 (10Nemo_bis) 05Open>03declined We can't control the issue which this task claims is ongoing ("Yahoo does X"). We can only make sure we do our best to make our messages deliverable, as asked in T5... [13:50:27] Dereckson: I beleive it, and yet I don't know why mwdebug1002 and production are still different. Could be an issue with that sync-script maybe [13:51:11] (03PS1) 1020after4: Phabricator: add config for elasticsearch 5 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) [13:51:39] dereckson@tin:/srv/mediawiki-staging/portals$ md5sum data/site-stats.json [13:51:43] fb31996d1d22400f06934df21cd65095 data/site-stats.json [13:51:44] dereckson@mw1170:/srv/mediawiki/portals$ md5sum data/site-stats.json [13:51:48] 206d2a0ab97fee98cb1cbe49c1639b93 data/site-stats.json [13:52:33] those should be the same right? [13:52:37] right [13:52:55] (03CR) 10Filippo Giunchedi: "LGTM, though we're making the same mistake in hhvm and mediawiki modules, would be good to fix that and possibly others" [puppet] - 10https://gerrit.wikimedia.org/r/342608 (https://phabricator.wikimedia.org/T160405) (owner: 10Ema) [13:52:57] and we could have imagined `scap sync-dir portals` would update subfiles [13:54:32] Dereckson: I assume that's what this line should do `scap sync-dir portals $*` [13:54:37] https://phabricator.wikimedia.org/diffusion/WPOR/browse/master/sync-portals [13:54:47] * Dereckson nods [13:55:12] I was looking at https://github.com/wikimedia/portals/compare/90f81ea0...e576c185 there are a lot of files modified, so it won't be convenient to scap sync-file [13:55:40] Let's try to touch portals/ and resync [13:56:30] !log labsdb1001 maintain-views --all-databases --table page --replace-all --debug [13:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:55] !log dereckson@tin Synchronized portals: Resync portals/ directory after touch (duration: 00m 42s) [13:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:23] ah [13:57:24] this time: [13:57:27] dereckson@mw1170:/srv/mediawiki/portals$ md5sum data/site-stats.json [13:57:30] fb31996d1d22400f06934df21cd65095 data/site-stats.json [13:58:19] jan_drewniak: looks good in prod now? [13:59:37] (03CR) 10Alexandros Kosiaris: [C: 032] Fix prometheus OSM sync lag exporter cronspam [puppet] - 10https://gerrit.wikimedia.org/r/342625 (owner: 10Alexandros Kosiaris) [13:59:42] (03PS3) 10Alexandros Kosiaris: Fix prometheus OSM sync lag exporter cronspam [puppet] - 10https://gerrit.wikimedia.org/r/342625 [13:59:58] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Fix prometheus OSM sync lag exporter cronspam [puppet] - 10https://gerrit.wikimedia.org/r/342625 (owner: 10Alexandros Kosiaris) [13:59:58] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 736240 [14:00:21] Dereckson: ... not really :( I still see the old version [14:00:58] Did the list of things to purge get purged? [14:01:07] Let's resend the purge [14:01:17] (03CR) 1020after4: "This will make phabricator write to both clusters while reading only from eqiad." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [14:01:30] !log Purged portals URL [14:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:43] Dereckson: I tink that did [14:02:54] (03CR) 1020after4: "http://puppet-compiler.wmflabs.org/5762/iridium.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [14:03:08] it. as in, I see the updated version in production now [14:03:18] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:03:53] Good. [14:04:33] very good! is there something that should be changed so the dir's actually sync? [14:05:09] I'm n [14:05:18] I'm not sure, the only thing I did is a touch [14:05:49] but normally, the directory modified date already changed after git submodule update [14:07:51] yeah that's weird [14:09:40] 06Operations: Upgrading python-phabricator in trusty - https://phabricator.wikimedia.org/T160408#3097854 (10mmodell) It might be comforting to know that the old version quite probably does not work with our modern version of phabricator. Authentication tokens changed quite a long time ago in a backwards incompa... [14:09:48] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:16:52] 06Operations: Upgrading python-phabricator in trusty - https://phabricator.wikimedia.org/T160408#3097890 (10MoritzMuehlenhoff) @mmodel: Ack, that's what triggered me to prepare the backport, Conduit completely failed to authenticate with 0.4. I'll upgrade this tomorrow on iridium and labcontrol*, then [14:17:47] (03PS1) 10Jcrespo: osc_host.sh: Add support for gtid-based online alter table [software] - 10https://gerrit.wikimedia.org/r/342629 (https://phabricator.wikimedia.org/T160407) [14:18:02] !log labsdb1003 time maintain-views --all-databases --table page --replace-all --debug [14:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:06] Dereckson: swat done? [14:20:11] !log labsdb100[9|10|11] 'maintain-views --all-databases --table page --replace-all --debug' [14:20:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:00] (03CR) 10Marostegui: [C: 031] osc_host.sh: Add support for gtid-based online alter table [software] - 10https://gerrit.wikimedia.org/r/342629 (https://phabricator.wikimedia.org/T160407) (owner: 10Jcrespo) [14:22:05] (03PS2) 10Elukey: Add the Jmxtrans configuration for the MapReduce History server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) [14:22:29] (03CR) 10jerkins-bot: [V: 04-1] Add the Jmxtrans configuration for the MapReduce History server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [14:23:18] thank you jenkins, always a pleasure [14:23:44] 06Operations, 06Labs: Add lock_wait_timeout to maintain_views and maintain-meta_p - https://phabricator.wikimedia.org/T160412#3097911 (10chasemp) [14:23:55] 06Operations, 06Labs: Add lock_wait_timeout to maintain_views and maintain-meta_p - https://phabricator.wikimedia.org/T160412#3097923 (10chasemp) p:05Triage>03Normal [14:24:53] 06Operations, 06Commons, 10Datasets-General-or-Unknown, 07Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229#3092719 (10akosiaris) Seeing bacula being mentioned (I indeed should try to update the wikitech page, although not many things have changed), I just... [14:25:38] (03PS3) 10Elukey: Add the Jmxtrans configuration for the MapReduce History server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) [14:26:58] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342607 [14:29:40] marostegui: yes [14:29:53] ok thanks! will deploy db-eqiad.php [14:33:35] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342607 (owner: 10Marostegui) [14:35:30] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342607 (owner: 10Marostegui) [14:35:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342607 (owner: 10Marostegui) [14:36:24] !log Enabled parallel replication (5 threads) on db2033 (x1) - T160407 [14:36:28] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080 - T132416 (duration: 00m 40s) [14:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:31] T160407: Apply wikishared.cx_translations index change - https://phabricator.wikimedia.org/T160407 [14:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:35] T132416: Rampant differences in indexes on enwiki.revision across the DB cluster - https://phabricator.wikimedia.org/T132416 [14:38:48] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:39:42] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3097998 (10MoritzMuehlenhoff) FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17 I'll rebuild our 3.18.1 package with that... [14:40:36] 06Operations, 07HHVM, 13Patch-For-Review, 07Upstream: Build / migrate to HHVM 3.18 - https://phabricator.wikimedia.org/T158176#3098004 (10Reedy) >>! In T158176#3097998, @MoritzMuehlenhoff wrote: > FIxed in hhvm master with https://github.com/facebook/hhvm/commit/5207b59eb88e2f9820efb74442245a4f5aa9eb17 >... [14:41:31] (03PS1) 10Muehlenhoff: Add support for Phabricator offboarding to offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342633 (https://phabricator.wikimedia.org/T142825) [14:48:46] (03PS1) 10Hashar: contint: PHP ext build dependencies on Nodepool [puppet] - 10https://gerrit.wikimedia.org/r/342635 (https://phabricator.wikimedia.org/T134381) [14:49:18] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 646136 [14:49:22] (03CR) 10jerkins-bot: [V: 04-1] Add support for Phabricator offboarding to offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342633 (https://phabricator.wikimedia.org/T142825) (owner: 10Muehlenhoff) [14:49:34] (03CR) 1020after4: [C: 031] "Note: I already have the indexer running on iridium indexing to codfw to get the es5 index up to date. Once indexing is complete and this " [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [14:50:51] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3098017 (10Papaul) Disk replacement complete. [14:51:40] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3098019 (10Papaul) a:05Papaul>03fgiunchedi [14:53:41] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/5765/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [14:53:53] (03CR) 10Alexandros Kosiaris: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/334149 (owner: 10Alexandros Kosiaris) [14:56:14] (03CR) 10jerkins-bot: [V: 04-1] nagios: Specify a parents host relationship [puppet] - 10https://gerrit.wikimedia.org/r/334149 (owner: 10Alexandros Kosiaris) [14:58:19] (03PS2) 10Hashar: contint: PHP ext build dependencies on Nodepool [puppet] - 10https://gerrit.wikimedia.org/r/342635 (https://phabricator.wikimedia.org/T134381) [14:59:29] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:01:33] 06Operations, 10ops-codfw: Degraded RAID on ms-be2008 - https://phabricator.wikimedia.org/T160312#3098045 (10fgiunchedi) 05Open>03Resolved Disk rebuilding [15:05:27] (03PS2) 10Muehlenhoff: Add support for Phabricator offboarding to offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/342633 (https://phabricator.wikimedia.org/T142825) [15:09:27] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3098089 (10Ottomata) Ya, I expect things to be fine on the varnishes. I was also hoping that our currently deplo... [15:11:15] (03PS1) 10Alexandros Kosiaris: Update puppet-lint gem to 2.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/342637 [15:12:04] (03CR) 10jerkins-bot: [V: 04-1] Update puppet-lint gem to 2.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/342637 (owner: 10Alexandros Kosiaris) [15:13:05] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3098094 (10Paladox) >>! In T160381#3097832, @Nemo_bis wrote: > We can't control the issue which this task claims is ongoing ("Yahoo does X"). We can only make sure we do our best to make our messages deliver... [15:13:27] (03CR) 10BryanDavis: [C: 032] Remove support for Precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342061 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis) [15:13:45] (03CR) 10Ottomata: [C: 031] "1 nit, but +1 :)" (032 comments) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:14:28] (03CR) 10BryanDavis: [C: 032] Full PEP8/Flake8 compliance [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342069 (owner: 10BryanDavis) [15:14:37] (03Merged) 10jenkins-bot: Remove support for Precise [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342061 (https://phabricator.wikimedia.org/T94792) (owner: 10BryanDavis) [15:15:18] (03Merged) 10jenkins-bot: Full PEP8/Flake8 compliance [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/342069 (owner: 10BryanDavis) [15:15:31] (03PS2) 10Alexandros Kosiaris: Update puppet-lint gem to 2.0.2 [puppet] - 10https://gerrit.wikimedia.org/r/342637 [15:15:39] (03PS4) 10Elukey: Add the Jmxtrans configuration for the MapReduce History server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) [15:17:38] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 629051 [15:18:39] !log preparing to branch 1.29.0-wmf.16 refs T158997 [15:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:44] T158997: MW-1.29.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T158997 [15:19:59] (03CR) 10Elukey: "All good again https://puppet-compiler.wmflabs.org/5767/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:21:32] 06Operations, 06Discovery, 10Wikimedia-Portals, 03Discovery-Portal-Sprint, and 2 others: https://www.wikipedia.org/ portal doesn't have any text - https://phabricator.wikimedia.org/T158782#3098145 (10Jdrewniak) With the changes to [[ https://gerrit.wikimedia.org/r/339657 | the server config ]] and the [[ h... [15:24:03] !log silence toolschecker precise job start check in anticipation of removal [15:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:32] (03PS5) 10Elukey: Add the Jmxtrans configuration for the MapReduce History server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) [15:25:38] PROBLEM - puppet last run on mw1262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:25:51] (03CR) 10Jcrespo: [C: 032] osc_host.sh: Add support for gtid-based online alter table [software] - 10https://gerrit.wikimedia.org/r/342629 (https://phabricator.wikimedia.org/T160407) (owner: 10Jcrespo) [15:28:00] (03Merged) 10jenkins-bot: osc_host.sh: Add support for gtid-based online alter table [software] - 10https://gerrit.wikimedia.org/r/342629 (https://phabricator.wikimedia.org/T160407) (owner: 10Jcrespo) [15:33:08] 06Operations, 06Commons, 10Datasets-General-or-Unknown, 07Community-Wishlist-Survey-2016: Back up of Commons files - https://phabricator.wikimedia.org/T160229#3098176 (10Nemo_bis) > Are these issues identified in the "server" side of things Yes, mostly issues with Swift. [15:36:38] (03CR) 10Elukey: [C: 032] Add the Jmxtrans configuration for the MapReduce History server [puppet/cdh] - 10https://gerrit.wikimedia.org/r/342592 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:38:36] 06Operations, 10ops-codfw: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689#3098217 (10fgiunchedi) FWIW this host is slated for decom in some weeks, I wouldn't spend too much time on its idrac especially if there's other hosts not to be decom with broken idrac [15:39:18] !log shut ms-be2002 for idrac / bios troubleshooting T155689 [15:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:24] T155689: ms-be2002.codfw.wmnet has drac issues - https://phabricator.wikimedia.org/T155689 [15:39:26] (03PS1) 10Elukey: Update the cdh module's sha with the latest change [puppet] - 10https://gerrit.wikimedia.org/r/342640 (https://phabricator.wikimedia.org/T156272) [15:40:30] 06Operations, 10Pybal, 10Traffic, 13Patch-For-Review: pybal stops logging - https://phabricator.wikimedia.org/T160405#3098227 (10ema) On machines where pybal doesn't log anything, the pybal process is indeed not doing any meaningful writes to standard out/err: ``` $ sudo strace -e trace=write -e write=1,2... [15:41:16] (03CR) 10Elukey: [C: 032] Update the cdh module's sha with the latest change [puppet] - 10https://gerrit.wikimedia.org/r/342640 (https://phabricator.wikimedia.org/T156272) (owner: 10Elukey) [15:50:19] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:53:38] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:54:01] (03CR) 10Alexandros Kosiaris: [C: 04-1] service: Send uwsgi logs to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321096 (https://phabricator.wikimedia.org/T149010) (owner: 10Ladsgroup) [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T1600). Please do the needful. [16:00:43] no patches [16:01:16] https://media.giphy.com/media/3o7abldj0b3rxrZUxW/giphy.gif [16:06:55] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3098349 (10Pchelolo) @faidon The `librdkafka` [[ https://github.com/edenhill/librdkafka/releases | changelog ]]... [16:10:04] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3098364 (10faidon) Oh, I hadn't realized node-rdkafka was using the C++ API. Yes, the C++ ABI is unstable, cf. h... [16:10:18] PROBLEM - puppet last run on aqs1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:10:40] !log T111113: Restart Cassandra in RESTBase Staging to enable optional client encryption [16:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:45] T111113: Cassandra client encryption - https://phabricator.wikimedia.org/T111113 [16:19:18] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [16:27:38] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 2 [16:29:07] !log no reponse from db1057 after powercycle- trying to hard reset it [16:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:13] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342644 (https://phabricator.wikimedia.org/T160427) [16:32:08] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 592922 [16:34:59] (03PS17) 10Filippo Giunchedi: prometheus: add snmp_exporter module and profile [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [16:35:01] (03PS7) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [16:35:03] (03PS1) 10Filippo Giunchedi: Add network::monitor role [puppet] - 10https://gerrit.wikimedia.org/r/342648 (https://phabricator.wikimedia.org/T148541) [16:35:55] (03PS10) 10Eevans: Cassanra TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) [16:35:57] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [16:36:46] (03CR) 10Eevans: [C: 031] "This is now ready to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [16:36:49] (03PS18) 10Filippo Giunchedi: prometheus: add snmp_exporter module and profile [puppet] - 10https://gerrit.wikimedia.org/r/341005 (https://phabricator.wikimedia.org/T148541) [16:36:51] (03PS2) 10Filippo Giunchedi: Add network::monitor role [puppet] - 10https://gerrit.wikimedia.org/r/342648 (https://phabricator.wikimedia.org/T148541) [16:36:53] (03PS8) 10Filippo Giunchedi: [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) [16:38:44] (03CR) 10jerkins-bot: [V: 04-1] [WIP] add PDUs jobs to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/341535 (https://phabricator.wikimedia.org/T148541) (owner: 10Filippo Giunchedi) [16:39:18] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:49:58] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [16:51:14] (03PS6) 10Mbch331: Remove exception on Other Projects sidebar for Dutch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/341195 (https://phabricator.wikimedia.org/T159634) [16:56:38] 06Operations, 10Pybal, 10Traffic: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433#3098548 (10ema) [16:56:57] 06Operations, 10Pybal, 10Traffic: Upgrade twisted on load balancers to 16.2.0 - https://phabricator.wikimedia.org/T160433#3098571 (10ema) p:05Triage>03Normal [16:57:20] 06Operations, 10ops-eqiad: db1057 does not react to powercycle/powerdown/powerup commands - https://phabricator.wikimedia.org/T160435#3098576 (10jcrespo) [16:58:03] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3098592 (10Ottomata) Included 0.9.4 in [[ https://apt.wikimedia.org/wikimedia/pool/backports/libr/librdkafka/ |... [16:58:38] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T1700). [17:00:12] no parsoid deploy today [17:02:08] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 137 [17:07:17] !log upgrading librdkafka to 0.9.4 on cache misc and restarting varnishkafka [17:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:38] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 1717 MB (3% inode=97%) [17:12:09] mhh I'll take a look [17:15:33] sigh statsite logs [17:15:38] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 1681 MB (3% inode=97%) [17:15:59] (03PS2) 10Jforrester: Show 'Publish' not 'Save' on most public wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337530 (https://phabricator.wikimedia.org/T131132) [17:16:16] (03CR) 10Jforrester: [C: 031] "Scheduled for tomorrow's 11:00 SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/337530 (https://phabricator.wikimedia.org/T131132) (owner: 10Jforrester) [17:20:38] RECOVERY - Disk space on graphite1001 is OK: DISK OK [17:23:08] PROBLEM - puppet last run on mx1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:30] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 06Discovery-Search (Current work): Update logstash on wikimedia to 5.x - https://phabricator.wikimedia.org/T154473#3098701 (10Deskana) @EBernhardson thinks this may need splitting up into several tasks. There's a lot of major version upgra... [17:26:38] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [17:29:18] (03PS1) 10Chad: Add gigantic warning to files/ directory [puppet] - 10https://gerrit.wikimedia.org/r/342653 [17:29:44] 06Operations, 06Discovery, 06Discovery-Search (Current work): remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884#3098712 (10Deskana) p:05Triage>03Normal [17:30:29] (03PS1) 10Filippo Giunchedi: statsite: ignore spammy messages [puppet] - 10https://gerrit.wikimedia.org/r/342654 (https://phabricator.wikimedia.org/T73322) [17:31:28] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:31:39] (03PS1) 10Madhuvishy: tools: Turn precise reminder cron off [puppet] - 10https://gerrit.wikimedia.org/r/342655 [17:34:19] (03PS2) 10Madhuvishy: tools: Turn precise reminder cron off [puppet] - 10https://gerrit.wikimedia.org/r/342655 [17:35:48] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:35:53] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3098754 (10AndyRussG) >>! In T154954#3095397, @ema wrote: > Uhm, 3K purges 50-200 times a day seem too many for banner u... [17:37:12] (03CR) 10Madhuvishy: [C: 032] tools: Turn precise reminder cron off [puppet] - 10https://gerrit.wikimedia.org/r/342655 (owner: 10Madhuvishy) [17:42:59] (03PS3) 10Giuseppe Lavagetto: Add redis switching task, some more stages boilerplate [switchdc] - 10https://gerrit.wikimedia.org/r/342498 [17:43:01] (03PS1) 10Giuseppe Lavagetto: Decouple logging setup from importing the module [switchdc] - 10https://gerrit.wikimedia.org/r/342657 [17:47:27] 06Operations, 10DNS, 10Parsoid, 10Traffic, 13Patch-For-Review: Separate subdomain for parsoid visual diff test service on ruthenium - https://phabricator.wikimedia.org/T159995#3098777 (10Dzahn) @ema does that seem ok to you? review of patches would be great if you have the time [17:49:17] (03PS5) 10Ottomata: webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) (owner: 10Krinkle) [17:49:27] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Use trebuchet install of eventlogging for ve.py [puppet] - 10https://gerrit.wikimedia.org/r/341723 (https://phabricator.wikimedia.org/T131977) (owner: 10Krinkle) [17:49:36] (03PS5) 10Ottomata: webperf: Update navtiming.py to use eventlogging instead of zmq [puppet] - 10https://gerrit.wikimedia.org/r/341724 (owner: 10Krinkle) [17:49:48] (03PS2) 10Andrew Bogott: Move keystone icinga checks to nrpe on labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/334658 (https://phabricator.wikimedia.org/T157760) [17:49:54] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Update navtiming.py to use eventlogging instead of zmq [puppet] - 10https://gerrit.wikimedia.org/r/341724 (owner: 10Krinkle) [17:50:01] (03CR) 10Andrew Bogott: Move keystone icinga checks to nrpe on labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/334658 (https://phabricator.wikimedia.org/T157760) (owner: 10Andrew Bogott) [17:51:08] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:52:22] (03PS1) 10Madhuvishy: tools: Deprecate precise_reminder role and clean up related script [puppet] - 10https://gerrit.wikimedia.org/r/342658 (https://phabricator.wikimedia.org/T149214) [17:52:28] (03PS3) 10Andrew Bogott: Move keystone icinga checks to nrpe on labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/334658 (https://phabricator.wikimedia.org/T157760) [17:55:58] 06Operations, 10MediaWiki-Configuration, 06Performance-Team, 06Services (watching), and 5 others: Allow integration of data from etcd into the MediaWiki configuration - https://phabricator.wikimedia.org/T156924#3098843 (10Joe) >>! In T156924#3096695, @aaron wrote: > Assuming there are decent and simple lib... [17:56:22] (03PS6) 10Ottomata: navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044 (owner: 10Krinkle) [17:57:36] (03CR) 10Krinkle: [C: 031] navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044 (owner: 10Krinkle) [17:58:46] (03PS1) 10Chad: Clean up l10nupdate cache junk when `scap clean` is run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342659 (https://phabricator.wikimedia.org/T119747) [17:59:19] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 267 [17:59:23] (03CR) 10Ottomata: [V: 032 C: 032] navtiming: Make tests easier to extend [puppet] - 10https://gerrit.wikimedia.org/r/338044 (owner: 10Krinkle) [17:59:28] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [17:59:34] (03CR) 10Dzahn: "i just find the naming ("mirror"), a bit unfortunate since the topology is supposed to be a cluster with equal members that can failover, " [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [17:59:43] (03PS23) 10Ottomata: webperf: Update event logging consumer for userAgent schema change [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [17:59:49] (03CR) 10Ottomata: [V: 032 C: 032] webperf: Update event logging consumer for userAgent schema change [puppet] - 10https://gerrit.wikimedia.org/r/337158 (https://phabricator.wikimedia.org/T156760) (owner: 10Nuria) [18:00:13] (03PS2) 10Dzahn: Phabricator: add config for elasticsearch 5 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:01:49] (03CR) 10Dzahn: "i saw more details and your comments, yea, i'm not gonna slow this down over naming concerns" [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:02:27] (03CR) 10Dzahn: [C: 031] Phabricator: add config for elasticsearch 5 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:03:48] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:06:51] (03CR) 1020after4: [C: 032] Clean up l10nupdate cache junk when `scap clean` is run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342659 (https://phabricator.wikimedia.org/T119747) (owner: 10Chad) [18:07:55] (03CR) 1020after4: [C: 031] "yeah naming isn't great. neither is the messy way this uses local puppet variables and hiera lookups. I'll come up with something cleaner " [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:09:40] I'm about to merge some monitoring changes, so it's safe to ignore spurious keystone alerts for the next hour or two [18:09:57] (03PS4) 10Andrew Bogott: Move keystone icinga checks to nrpe on labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/334658 (https://phabricator.wikimedia.org/T157760) [18:11:42] !log nuria@tin Started deploy [eventlogging/analytics@c3ccb4a]: (no justification provided) [18:11:46] !log nuria@tin Finished deploy [eventlogging/analytics@c3ccb4a]: (no justification provided) (duration: 00m 03s) [18:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:34] !log upgrading librdkafka to 0.9.4 and restarting varnishkafka on cache misc hosts [18:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:22] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#3098923 (10madhuvishy) [18:15:15] (03PS1) 10BryanDavis: labs: Remove references to tools-webgrid-lighttpd-12* [puppet] - 10https://gerrit.wikimedia.org/r/342660 (https://phabricator.wikimedia.org/T160442) [18:16:02] (03CR) 10Dzahn: [C: 031] "ok cool. re: Hiera lookups, we are supposed to use all of these to the newer "profile" structure anyways, at some point. Nowadays https://" [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:16:58] (03CR) 10Andrew Bogott: [C: 032] labs: Remove references to tools-webgrid-lighttpd-12* [puppet] - 10https://gerrit.wikimedia.org/r/342660 (https://phabricator.wikimedia.org/T160442) (owner: 10BryanDavis) [18:18:29] (03CR) 10Dzahn: [C: 031] "s/use/move/g" [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:19:02] !log upgrading librdkafka to 0.9.4 and restarting varnishkafka on cache upload hosts [18:19:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:25] (03Merged) 10jenkins-bot: Clean up l10nupdate cache junk when `scap clean` is run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342659 (https://phabricator.wikimedia.org/T119747) (owner: 10Chad) [18:23:38] (03CR) 10jenkins-bot: Clean up l10nupdate cache junk when `scap clean` is run [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342659 (https://phabricator.wikimedia.org/T119747) (owner: 10Chad) [18:24:28] RECOVERY - Unmerged changes on repository puppet on labcontrol1002 is OK: No changes to merge. [18:25:30] (03CR) 10Dzahn: [V: 032 C: 032] "24min since rebase" [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:25:38] RECOVERY - Unmerged changes on repository puppet on labtestcontrol2001 is OK: No changes to merge. [18:27:05] (03CR) 10Dzahn: "@20after4: applied on iridium and phab2001" [puppet] - 10https://gerrit.wikimedia.org/r/342626 (https://phabricator.wikimedia.org/T157479) (owner: 1020after4) [18:28:13] (03CR) 10Dzahn: "@20after4 wanna confirm this?" [puppet] - 10https://gerrit.wikimedia.org/r/342275 (owner: 10Paladox) [18:28:41] (03Abandoned) 10Paladox: Phabricator: Remove three unneeded configs [puppet] - 10https://gerrit.wikimedia.org/r/342275 (owner: 10Paladox) [18:29:43] (03CR) 10Dzahn: "oh? not current anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/342275 (owner: 10Paladox) [18:30:17] (03CR) 10Paladox: "Nope this https://gerrit.wikimedia.org/r/342626 change did it :)" [puppet] - 10https://gerrit.wikimedia.org/r/342275 (owner: 10Paladox) [18:30:28] (03CR) 10Dzahn: "i don't understand the last part about the user being removed and nda" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [18:30:45] (03CR) 10Dzahn: "ok, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/342275 (owner: 10Paladox) [18:31:21] (03CR) 10Dzahn: "so what's up with this one then?" [puppet] - 10https://gerrit.wikimedia.org/r/342276 (owner: 10Paladox) [18:31:37] (03CR) 10Paladox: "Needs rebasing." [puppet] - 10https://gerrit.wikimedia.org/r/342276 (owner: 10Paladox) [18:31:51] (03PS4) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [18:31:57] (03PS5) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [18:32:20] (03CR) 10Dzahn: [C: 04-1] "+1 to hashar: "Lets do the migration to systemd proper instead of hacking"" [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) (owner: 10Paladox) [18:32:41] (03CR) 10Dzahn: "abstain :p" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [18:34:34] (03PS4) 10Dzahn: Rename ferm service in role::labs::db::replica [puppet] - 10https://gerrit.wikimedia.org/r/328683 (owner: 10Muehlenhoff) [18:35:29] (03PS5) 10Andrew Bogott: Move keystone icinga checks to nrpe on labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/334658 (https://phabricator.wikimedia.org/T157760) [18:35:41] (03PS1) 10Gehel: elasticsearch - no need to use swap [puppet] - 10https://gerrit.wikimedia.org/r/342662 (https://phabricator.wikimedia.org/T158884) [18:35:45] (03PS6) 10Paladox: Phabricator: Use hiera for deciding when to enable read and write for mysql search [puppet] - 10https://gerrit.wikimedia.org/r/342276 [18:35:53] (03CR) 10Dzahn: [C: 032] "merging, it's just renaming the resource and all DBAs +1ed :)" [puppet] - 10https://gerrit.wikimedia.org/r/328683 (owner: 10Muehlenhoff) [18:37:45] !log upgrading librdkafka to 0.9.4 and restarting varnishkafka on cache text hosts [18:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:28] !log removing swap from elasticsearch servers - T158884 [18:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:34] T158884: remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884 [18:42:13] !log nuria@tin Started deploy [eventlogging/analytics@417c40f]: (no justification provided) [18:42:15] !log nuria@tin Finished deploy [eventlogging/analytics@417c40f]: (no justification provided) (duration: 00m 02s) [18:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:51] (03PS10) 10Paladox: Gerrit: Use the mariadb plugin instead of mysql [puppet] - 10https://gerrit.wikimedia.org/r/336003 (https://phabricator.wikimedia.org/T145885) [18:44:24] (03CR) 10Andrew Bogott: [C: 032] Move keystone icinga checks to nrpe on labnet1001 [puppet] - 10https://gerrit.wikimedia.org/r/334658 (https://phabricator.wikimedia.org/T157760) (owner: 10Andrew Bogott) [18:45:08] 06Operations, 06Analytics-Kanban, 10ChangeProp, 10Reading-Web-Trending-Service, 06Services (watching): Upgrade librdkafka 0.9.4 on SCB and Varnishes - https://phabricator.wikimedia.org/T159379#3099128 (10Ottomata) librdkafka has been upgraded to 0.9.4 on all cache hosts, and varnishkafka has been restart... [18:45:44] 06Operations, 06Discovery, 06Discovery-Search (Current work), 13Patch-For-Review: remove swap from elasticsearch servers - https://phabricator.wikimedia.org/T158884#3050526 (10Gehel) Swap is now disabled on all elasticsearch servers. [[ https://gerrit.wikimedia.org/r/342662 | A patch ]] still needs to be m... [18:46:53] jynus: yt? [18:47:08] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [18:51:42] marostegui: yt? [18:52:01] (03Abandoned) 10Paladox: Zuul: Make sure git-daemon starts after installing it [puppet] - 10https://gerrit.wikimedia.org/r/340496 (https://phabricator.wikimedia.org/T157785) (owner: 10Paladox) [18:52:48] (03CR) 10Paladox: "> If the new URL in polygerrit would be the definitive ones, perhaps" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [18:53:07] (03PS5) 10Dzahn: Rename ferm service in role::labs::db::replica [puppet] - 10https://gerrit.wikimedia.org/r/328683 (owner: 10Muehlenhoff) [18:57:13] (03PS1) 10Giuseppe Lavagetto: Add cache wipe + warmup phase implementation [switchdc] - 10https://gerrit.wikimedia.org/r/342666 [18:58:53] (03PS2) 10Andrew Bogott: labs: Remove references to tools-webgrid-lighttpd-12* [puppet] - 10https://gerrit.wikimedia.org/r/342660 (https://phabricator.wikimedia.org/T160442) (owner: 10BryanDavis) [19:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T1900). [19:00:09] 06Operations, 10Mail: Yahoo is blocking mail from wikimedia - https://phabricator.wikimedia.org/T160381#3099227 (10Paladox) Yahoo has now unblocked wikimedia as I'm receiving mail on time now :) [19:00:28] (03PS1) 10Phuedx: pagePreviews: Enable perf instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342672 (https://phabricator.wikimedia.org/T157111) [19:00:56] (03CR) 10Dzahn: [C: 032] planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 (owner: 10Dzahn) [19:01:03] (03PS6) 10Dzahn: planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 [19:03:00] (03CR) 10Bmansurov: [C: 031] pagePreviews: Enable perf instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342672 (https://phabricator.wikimedia.org/T157111) (owner: 10Phuedx) [19:05:05] 06Operations, 10Mail: Get mail relay out of Yahoo! blacklist: apply to Yahoo for whitelisting bulk mail - https://phabricator.wikimedia.org/T58414#3099244 (10Paladox) It seems i managed to get yahoo to unblock wikimedia though unsure if i did. Anyways my mail is back to normal from wikimedia :) [19:09:08] could an opsen with access to the private git repo help answer a question about what is stored there? [19:09:18] urandom: yep [19:09:39] what do you need [19:09:41] mutante: cool; cassandra-ca-manager creates some key material that is deployed from there [19:10:01] one of the things it creates, is a self-signed root ca in pem format [19:10:06] but i don't know if that is checked in [19:10:19] or if it's just the java keystores [19:10:31] there are lots of .csr and .kst files [19:10:41] for cassandra/restbase [19:10:41] is there a rootCa.crt? [19:10:50] (03PS8) 10Paladox: Gerrit: Add some apache rewrite rules for polygerrit [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) [19:10:57] ./modules/secret/secrets/cassandra/services/rootCa.srl [19:11:09] so that sounds like No. [19:11:11] ./modules/secret/secrets/cassandra/services/rootCa.key [19:11:22] * urandom perks up [19:11:34] ./modules/secret/secrets/cassandra/services/rootCa.crt [19:11:37] (03CR) 10Paladox: "I will probably add a js config file see my fix at https://gerrit-review.googlesource.com/#/c/99004/ i have three parts to the fix but the" [puppet] - 10https://gerrit.wikimedia.org/r/340900 (https://phabricator.wikimedia.org/T156120) (owner: 10Paladox) [19:11:37] there it is [19:11:37] aha [19:11:50] mutante: snap, thanks! [19:11:51] wait, there is more [19:11:58] services-dev and services-test [19:12:07] with their own rootCa files in it [19:12:14] yeah, that makes sense [19:12:17] ok [19:12:23] alright, yw [19:12:36] mutante: hey, while i have you [19:12:37] ... [19:12:43] PROBLEM - Check for valid instance states on labnodepool1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:12:44] could i get a merge of https://gerrit.wikimedia.org/r/#/c/342088 ?} [19:12:45] i would rather not merge that change now [19:12:50] oh [19:12:51] ok [19:13:24] it's a no-op everywhere but staging, if that matters [19:13:45] !log Delete 2FA for User:Conny per request on IRC. Identy verified via Lydia_WMDE [19:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:51] http://puppet-compiler.wmflabs.org/5726/ [19:14:06] !log restarting pybal on lvs1010-11 T160405 [19:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:11] T160405: pybal stops logging - https://phabricator.wikimedia.org/T160405 [19:17:02] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1003 replacement - https://phabricator.wikimedia.org/T159839#3099320 (10Ottomata) Bump @robh [19:17:05] 06Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: EQIAD: stat1002 replacement - https://phabricator.wikimedia.org/T159838#3099321 (10Ottomata) Bump @robh [19:18:49] !log twentyafterfour@tin Started scap: full scap of new branch, move test wikis to 1.29.0-wmf.16 refs T158997 [19:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:55] T158997: MW-1.29.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T158997 [19:20:47] (03PS11) 10Eevans: Cassanra TLS configuration for RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) [19:21:11] (03CR) 10Dzahn: [V: 032 C: 032] planet: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342163 (owner: 10Dzahn) [19:22:33] RECOVERY - Check for valid instance states on labnodepool1001 is OK: nodepool state management is OK [19:22:42] (03PS3) 10Andrew Bogott: labs: Remove references to tools-webgrid-lighttpd-12* [puppet] - 10https://gerrit.wikimedia.org/r/342660 (https://phabricator.wikimedia.org/T160442) (owner: 10BryanDavis) [19:25:31] (03CR) 10Dzahn: [C: 031] microsites: convert to profile/role-structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn) [19:28:37] (03PS5) 10Dzahn: microsites: convert to profile/role-structure [puppet] - 10https://gerrit.wikimedia.org/r/342164 [19:29:17] PROBLEM - novaadmin has roles in every project on labnet1001 is CRITICAL: In testlabs, user novaadmin should have roles [user, projectadmin] but has [uprojectadmin] [19:30:47] PROBLEM - Keystone admin and observer projects exist on labtestnet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:37] PROBLEM - novaadmin has roles in every project on labtestnet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:31:47] PROBLEM - novaobserver has only observer role on labtestnet2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:32:31] ^^ is that normal? [19:32:38] chasemp andrewbogott ^^ [19:32:39] (03CR) 10Dzahn: [C: 031] "happy to remove the comments but then it should be changed in the working example too" [puppet] - 10https://gerrit.wikimedia.org/r/342164 (owner: 10Dzahn) [19:32:59] paladox: it's certainly not important [19:33:04] but, it's what I'm working on today. [19:33:07] Ok [19:33:17] PROBLEM - novaadmin has roles in every project on labnet1001 is CRITICAL: In testlabs, user novaadmin should have roles [user, projectadmin] but has [uuser] [19:34:17] RECOVERY - novaadmin has roles in every project on labnet1001 is OK: novaadmin has the correct roles in all projects. [19:35:17] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [19:38:50] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3099433 (10jcrespo) To be clear: there are 3 machines with eventlogging/analytics stuff (among other)- db1046, db1047 and dbstore1002 (this last one should still be ok for a c... [19:40:18] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:45:08] (03PS1) 10Eevans: [WIP] Enable cqlsh client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342679 (https://phabricator.wikimedia.org/T111113) [19:46:26] (03CR) 10Eevans: [C: 04-1] "Not ready to be merged." [puppet] - 10https://gerrit.wikimedia.org/r/342679 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [19:46:43] (03PS1) 10BryanDavis: labs: Remove references to tools-exec-12* [puppet] - 10https://gerrit.wikimedia.org/r/342680 (https://phabricator.wikimedia.org/T160457) [19:47:43] (03PS2) 10Eevans: [WIP] Enable cqlsh client encryption [puppet] - 10https://gerrit.wikimedia.org/r/342679 (https://phabricator.wikimedia.org/T111113) [19:49:17] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [19:49:37] (03CR) 10Andrew Bogott: [C: 032] labs: Remove references to tools-exec-12* [puppet] - 10https://gerrit.wikimedia.org/r/342680 (https://phabricator.wikimedia.org/T160457) (owner: 10BryanDavis) [19:51:05] (03PS1) 10Catrope: Follow-up b08f84348: enable goodfaith in beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342681 [19:52:27] (03PS1) 10Andrew Bogott: check_keystone_projects: Fix typo in error reporting [puppet] - 10https://gerrit.wikimedia.org/r/342682 [19:57:58] (03PS1) 10Gergő Tisza: Deploy PageViewInfo to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342683 (https://phabricator.wikimedia.org/T125917) [19:58:00] (03PS1) 10Gergő Tisza: Deploy PageViewInfo to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342684 (https://phabricator.wikimedia.org/T125917) [19:58:02] (03PS1) 10Gergő Tisza: Deploy PageViewInfo to group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342685 (https://phabricator.wikimedia.org/T125917) [20:02:55] (03CR) 10Mattflaschen: [C: 032] Follow-up b08f84348: enable goodfaith in beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342681 (owner: 10Catrope) [20:03:52] (03CR) 10Andrew Bogott: [C: 032] check_keystone_projects: Fix typo in error reporting [puppet] - 10https://gerrit.wikimedia.org/r/342682 (owner: 10Andrew Bogott) [20:04:08] Someone got a second to kill tin:/srv/deployment/abacist and mira:/srv/deployment/abacist? It seems to be a dead project (nothing in puppet, anywhere) [20:05:18] (03Merged) 10jenkins-bot: Follow-up b08f84348: enable goodfaith in beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342681 (owner: 10Catrope) [20:07:19] (03CR) 10jenkins-bot: Follow-up b08f84348: enable goodfaith in beta labs too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342681 (owner: 10Catrope) [20:08:44] (03CR) 10Jdlrobson: [C: 031] Make Page Previews use RESTBase on "stage 0" wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/340697 (https://phabricator.wikimedia.org/T158221) (owner: 10Phuedx) [20:10:27] PROBLEM - novaadmin has roles in every project on labtestnet2001 is CRITICAL: In reportcard, user novaadmin should have roles [user, projectadmin] but has [uprojectadmin] [20:10:45] 06Operations, 10Analytics, 10DBA: Prep to decommission old dbstore hosts (db1046, db1047) - https://phabricator.wikimedia.org/T156844#3099522 (10Ottomata) Oh yeah, rats, I totally forgot to put this in our budget request. Hm. do db1046 and db1047 host just EL data, or also wiki dbs? [20:10:55] (03PS1) 10Andrew Bogott: check_keystone_roles.py: Exclude domains from project checks. [puppet] - 10https://gerrit.wikimedia.org/r/342686 [20:11:20] (03CR) 10Andrew Bogott: [C: 032] check_keystone_roles.py: Exclude domains from project checks. [puppet] - 10https://gerrit.wikimedia.org/r/342686 (owner: 10Andrew Bogott) [20:14:55] !log twentyafterfour@tin Finished scap: full scap of new branch, move test wikis to 1.29.0-wmf.16 refs T158997 (duration: 56m 05s) [20:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:01] T158997: MW-1.29.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T158997 [20:17:53] !log scap was unable to connect to mw2256.codfw.wmnet [20:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:42] (03PS1) 10BryanDavis: labs: Remove references to tools-exec-gift [puppet] - 10https://gerrit.wikimedia.org/r/342688 (https://phabricator.wikimedia.org/T160461) [20:19:27] PROBLEM - novaadmin has roles in every project on labtestnet2001 is CRITICAL: In reportcard, user novaadmin should have roles [user, projectadmin] but has [uprojectadmin] [20:19:40] (03PS2) 10Andrew Bogott: labs: Remove references to tools-exec-gift [puppet] - 10https://gerrit.wikimedia.org/r/342688 (https://phabricator.wikimedia.org/T160461) (owner: 10BryanDavis) [20:21:46] (03CR) 10Chad: "The user gerrit2 should be removed from LDAP (cf: T160122) for a myriad of reasons. Recently we noticed it somehow ended up in the NDA gro" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/342082 (owner: 10Paladox) [20:22:05] (03CR) 10Andrew Bogott: [C: 032] labs: Remove references to tools-exec-gift [puppet] - 10https://gerrit.wikimedia.org/r/342688 (https://phabricator.wikimedia.org/T160461) (owner: 10BryanDavis) [20:26:27] RECOVERY - novaadmin has roles in every project on labtestnet2001 is OK: novaadmin has the correct roles in all projects. [20:26:37] RECOVERY - novaobserver has only observer role on labtestnet2001 is OK: novaobserver has the correct roles in all projects. [20:26:37] RECOVERY - Keystone admin and observer projects exist on labtestnet2001 is OK: Keystone projects exist and have matching names and ids. [20:45:10] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings-labs.php: Beta Cluster only (duration: 02m 50s) [20:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:54] (03PS1) 10Dzahn: gerrit: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/342692 [20:52:03] (03PS2) 10Dzahn: gerrit: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/342692 [20:53:08] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#2776612 (10Nirmos) This seems to have caused T160465. [20:53:21] (03CR) 10Gergő Tisza: "See also https://gerrit.wikimedia.org/r/#/c/342694/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342683 (https://phabricator.wikimedia.org/T125917) (owner: 10Gergő Tisza) [20:54:59] (03PS1) 10Gehel: wdqs - set heap size for blazegraph from puppet [puppet] - 10https://gerrit.wikimedia.org/r/342695 (https://phabricator.wikimedia.org/T160218) [20:58:11] (03PS1) 10Jdlrobson: Restrict page images to lead section on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342696 (https://phabricator.wikimedia.org/T152115) [21:00:46] (03PS2) 10Gehel: wdqs - set heap size for blazegraph from puppet [puppet] - 10https://gerrit.wikimedia.org/r/342695 (https://phabricator.wikimedia.org/T160218) [21:02:54] (03CR) 10Smalyshev: [C: 031] wdqs - set heap size for blazegraph from puppet [puppet] - 10https://gerrit.wikimedia.org/r/342695 (https://phabricator.wikimedia.org/T160218) (owner: 10Gehel) [21:06:16] (03CR) 10Dzahn: "Error: secret(): invalid secret cassandra/services/restbase1001/restbase1001.kst http://puppet-compiler.wmflabs.org/5772/restbase1001.eq" [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [21:07:22] (03CR) 10Mobrovac: "@Dzahn, that means that the secret key is missing from the puppet compiler." [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [21:08:17] PROBLEM - puppet last run on db1087 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:23:08] (03CR) 10jerkins-bot: [V: 04-1] gerrit: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [21:26:07] (03CR) 10Dzahn: "yes, should be added to labs/private so this can be compiled" [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [21:26:29] (03PS1) 10BryanDavis: labs: Remove references to tools-precise-dev [puppet] - 10https://gerrit.wikimedia.org/r/342713 (https://phabricator.wikimedia.org/T160466) [21:28:06] (03CR) 10Andrew Bogott: [C: 032] labs: Remove references to tools-precise-dev [puppet] - 10https://gerrit.wikimedia.org/r/342713 (https://phabricator.wikimedia.org/T160466) (owner: 10BryanDavis) [21:29:14] !log reindexed search in group0 for mondays codfw search downtime/upgrade [21:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:01] (03CR) 10jerkins-bot: [V: 04-1] wdqs - set heap size for blazegraph from puppet [puppet] - 10https://gerrit.wikimedia.org/r/342695 (https://phabricator.wikimedia.org/T160218) (owner: 10Gehel) [21:34:41] Will https://gerrit.wikimedia.org/r/#/c/342456/ need to be swatted? [21:34:56] (03PS2) 10Gergő Tisza: Deploy PageViewInfo to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342683 (https://phabricator.wikimedia.org/T125917) [21:34:58] (03PS2) 10Gergő Tisza: Deploy PageViewInfo to group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342684 (https://phabricator.wikimedia.org/T125917) [21:35:00] (03PS2) 10Gergő Tisza: Deploy PageViewInfo to group2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342685 (https://phabricator.wikimedia.org/T125917) [21:35:02] (03PS1) 10Gergő Tisza: Deploy PageViewInfo to group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342728 (https://phabricator.wikimedia.org/T125917) [21:36:17] (03CR) 10Reedy: [C: 04-1] Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) Should be merged with Ic9851d53affe0f4ece7a79f541ec5cb39133b109 (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:36:17] RECOVERY - puppet last run on db1087 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:36:23] Zppix: If you want it deploying, yes [21:37:19] That weird char change wasnt done by me i think gerrit glitched or something cause the patch shouldnt touch those to begin with [21:37:25] Reedy: [21:37:35] Yeah, I know [21:37:38] But they need correcting :) [21:37:45] gerrit doesn't glitch [21:37:48] it's perfect [21:37:56] It is probably browser corruption [21:38:10] Zppix: like from the top of https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php [21:38:14] # Here are some umlauted vowels: öäü [21:38:14] # Here is a deck of cards: ♠♣♥♦ [21:38:14] # If you do not see these characters, do not edit this file. [21:38:42] I doubt that's representative enough of unicode we use anymore :) [21:39:02] Reedy: i dont speak those langs so im unable to fix it properly without possibly making it worse [21:39:03] Could easy have a browser/editor that barfs on characters beyond those [21:39:13] heh [21:39:24] Reedy: also it didnt change them till paladox made a commit to the patch [21:39:27] I know [21:39:31] Which blames the browser [21:39:40] Zppix: The easiest way to fix it, may be to just make a new commit with the changes you want, and give it hte same commit summary and change id [21:39:51] Locally, not using that stupid web editor :) [21:39:56] Reedy: i could reinsert it from noc? [21:40:01] Or that [21:40:06] Ok [21:40:13] Yeah, if you copy them in from the source, in a decent text editor [21:40:33] Reedy: all my commits were done on the web editor and it was fine [21:40:40] (03PS14) 10Zppix: Removal of "editusercssjs" in MW-CORE (Mediawiki 1.16 Depricated) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 [21:40:46] Is paladox still using Edge? [21:40:49] It's probably broken [21:41:35] Idk [21:41:43] Im using safari on ios 10.2.1 [21:42:33] Zppix: The corruption is still there anyway [21:42:52] Ps14 was just the commit msg [21:43:27] Reedy could you try inserting the char im curious to see what it does [21:50:04] living on the Edge [21:50:17] (TM) [21:50:29] Let me fix it [21:51:17] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:54:10] (03PS15) 10Reedy: Removal of "editusercssjs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:56:03] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3100090 (10Aklapper) >>! In T150183#3097639, @Addshore wrote: > InterwikiSorting is now deployed to all w... [21:56:31] (03CR) 10Reedy: [C: 031] "Good to be swatted now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342456 (owner: 10Zppix) [21:57:07] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:00:04] tgr: Respected human, time to deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T2200). Please do the needful. [22:03:31] (03PS3) 10Dzahn: gerrit: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/342692 [22:05:44] Reedy: do i do it for swat euro or mediawiki train? [22:05:55] One of the SWAT [22:06:01] It's not something really associated with the train [22:07:24] Train would be alright if you were fixing a core bug that the train was likely to be making more prevelent on the cluster [22:10:55] train? [22:11:36] choooo choooo [22:12:25] do we have something called "mediawiki train"? [22:17:49] the weekly deploys [22:19:17] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [22:19:33] (03CR) 10Gergő Tisza: [V: 032 C: 032] "Force-merging, Jenkins is too slow today to fit into a one-hour deploy window. PS1 went through CI, the difference is trivial." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342683 (https://phabricator.wikimedia.org/T125917) (owner: 10Gergő Tisza) [22:20:54] (03CR) 10jenkins-bot: Deploy PageViewInfo to testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342683 (https://phabricator.wikimedia.org/T125917) (owner: 10Gergő Tisza) [22:22:07] twentyafterfour: the 15->16 change in wikiversions.js is uncommitted, should I just make a commit for that? [22:22:36] tgr sorry I'll do it [22:22:45] (would sure have been nice if I checked that before merging my config patch) [22:23:08] (03PS1) 1020after4: group0 wikis to 1.29.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342761 [22:23:09] twentyafterfour: you'll probably have to rebase it on https://gerrit.wikimedia.org/r/342683 , sorry [22:23:10] (03CR) 1020after4: [C: 032] group0 wikis to 1.29.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342761 (owner: 1020after4) [22:23:30] tgr it's ok to deploy your config change [22:23:49] I reverted the wikiversions change and resubmitted it via gerrit [22:23:53] you can deploy first [22:24:52] (03Merged) 10jenkins-bot: group0 wikis to 1.29.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342761 (owner: 1020after4) [22:25:02] (03CR) 10jenkins-bot: group0 wikis to 1.29.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342761 (owner: 1020after4) [22:25:07] twentyafterfour: the json file is not used except by the train scripts, right? [22:25:07] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [22:25:26] tgr: yeah [22:25:30] cool, thanks [22:25:45] it's ok to sync that too or I will do it right after you deploy your config change [22:26:04] I need to do a full scap, new extension [22:26:09] cool [22:28:05] Reedy i use edge occasionally now :) [22:29:13] A full scap will push wikiversions out... [22:29:14] RainbowSprinkles i like the web editor :), it's codemirror. [22:29:28] Reedy: and that's fine [22:29:38] I thought you'd reverted your change from tin? [22:29:42] Oh, it merged [22:29:44] nvm :) [22:31:00] can I test adding a new extension on mwdebug? [22:31:20] it wouldn't work because l10n needs scap, right? [22:31:37] right, I think you need full scap [22:31:56] I mean, you could sync just the l10n stuff manually and then test [22:32:34] how do I generate the cdbs and whatnot manually? [22:32:34] like `scap l10nupdate` and `scap sync-l10n` [22:32:53] not exactly manually but it bypasses syncing the code [22:32:58] do I run that on tin? [22:32:59] not sure if that's actually a good idea [22:33:10] tgr yeah tin [22:33:28] sync-l10n will touch the live servers anyways, I suppose? [22:33:29] those are the underlying commands that a full scap sync runs, along with rsyncing the code as well [22:33:33] yes [22:33:38] or maybe I can run l10nupdate on mwdebug? [22:33:50] yeah that could work [22:34:01] its not installed there is it? [22:34:03] I've never tried something like that before [22:34:12] bd808: I don't know [22:34:24] ^ bd808 would know better than I [22:34:36] the wrappers scripts aren't for sure [22:34:45] the extension would be I guess [22:34:57] Should run a full sync to build l10n everywhere [22:35:09] Then do your enable-my-extension patch and test that on mwdebug [22:35:20] yeah. that ^ [22:37:43] ugh, I would have to edit history since that's not the latest commit now [22:38:09] scap is installed on the mwdebugs, for scap pull I suppose [22:38:46] the worst I can do by running l10nupdate there is breaking the debug instance, right? [22:38:52] tgr: you've tested this in beta correct? what are you especially worried about? [22:38:53] so that seems safer [22:39:21] a mistake in the config patch, I suppose [22:39:35] doing things that are never done is seldom safer ;) [22:39:44] fair enough [22:39:53] Update: CI is backlogged no way to fix jenkins will prioritze +2s [22:39:59] I'll just sync then [22:42:07] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:42:24] !log tgr@tin Started scap: T125917: Deploy PageViewInfo to testwiki [22:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:31] T125917: Deploy the PageViewInfo extension to production - https://phabricator.wikimedia.org/T125917 [23:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170314T2300). [23:00:04] Krinkle, tgr, Jdlrobson, and MaxSem: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] o/ [23:00:11] \o [23:00:13] I can do it [23:00:25] MaxSem: I'm running over with scap [23:00:35] there's a bit of a meltdown with CI right now, so things are slow [23:00:40] tgr, sure [23:01:04] Is wmf.16 out anywhere? [23:01:19] https://tools.wmflabs.org/versions/ [23:01:24] says no [23:01:35] group0 [23:01:47] weird [23:01:48] MaxSem: in that case you can ignore the "enable perf instrumentation" patch [23:01:52] tgr: huh? [23:02:01] tgr im not seeing it on mediawiki.org [23:02:16] twentyafterfour: ^ maybe that tool got confused? [23:02:49] nope https://www.mediawiki.org/wiki/Special:Version [23:02:54] says 15 [23:02:54] https://www.mediawiki.org/wiki/Special:Version says 15 [23:02:58] jinx [23:02:59] :) [23:03:15] twentyafterfour: ^ [23:03:24] oh, oops, already pinged [23:03:28] hmm, edit pages on beta are 503 [23:03:55] it's on the testwikis though [23:04:12] o_O [23:04:13] Yup [23:04:16] tgr: it was only deployed to testwiki? [23:04:20] greg-g: My version on https://noc.wikimedia.org/conf/ is right ;P [23:05:16] https://github.com/wikimedia/operations-mediawiki-config/commit/b36fc38fcf0fb68379782bb77b44ceb2897210b8 [23:05:17] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:05:17] Hm.. indeed https://noc.wikimedia.org/conf/wikiversions.json says testwikis/mw is .16 [23:05:23] wonder why /versions/ thinks otherwise [23:05:28] Something is out of sink [23:05:34] *sync [23:05:43] (03PS1) 10Mattflaschen: Fix issues with ORES models: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342767 [23:05:53] maybe my merged patch didn't get pulled? [23:05:55] Reedy: maybe versions tool need a restart? [23:06:17] https://gerrit.wikimedia.org/r/#/c/342761/ [23:06:35] I'm running sync so that should fix it [23:07:08] needs `scap sync-wikiversions` [23:07:30] twentyafterfour: the local change on tin only contained the three testwikis, so I guess that's what actually got scapped [23:07:48] twentyafterfour: should I run that when the scap sync is finished? [23:08:13] if mediawiki.org is still on 15 then yeah [23:11:07] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:13:48] (03CR) 10Jforrester: [C: 031] Fix issues with ORES models: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342767 (owner: 10Mattflaschen) [23:17:06] sooooo how does all this impact swat? :) [23:17:27] hurry swat [23:17:31] it doesn't, it's delayed for unrelated reasons [23:18:09] bd808: should I be worried when scap says "connect to host timed out"? [23:18:19] or does that get retried? [23:18:28] tgr: depends on the host [23:18:34] does it start with a 2? [23:18:34] one of the apaches [23:18:39] mw2256.codfw.wmnet [23:18:43] tgr: one4 host was failing when I scapped earlier as well [23:18:45] same host [23:18:56] I guess if it is codfw then not a big deal? [23:19:02] it doesn't get auto-retried but it's not alive anyway [23:19:08] so retry would fail again [23:19:08] yeah [23:19:28] (03CR) 10Dzahn: "@eevans i see there is modules/secret/secrets/cassandra/services/restbase1001-a , restbase1001-b, restbase1001-c, but not just restbase100" [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [23:19:48] probabably just didn't get removed from the dsh list or something [23:19:54] yeah [23:20:30] (03CR) 10Dzahn: "actually, in production there is just restbase1007 and up" [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [23:21:34] (03CR) 10Dzahn: "could you add it in labs/private as it should be in prod? then the compiler run will work and a root can create the identical thing in pro" [puppet] - 10https://gerrit.wikimedia.org/r/342088 (https://phabricator.wikimedia.org/T111113) (owner: 10Eevans) [23:22:20] 06:52 elukey: powercycle mw2256, stuck in boot (looked in the console) [23:22:31] That was yesterday [23:23:09] https://phabricator.wikimedia.org/T155180 [23:23:19] looks like memory errors [23:23:57] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw: mw2251-mw2260 rack/setup - https://phabricator.wikimedia.org/T155180#2936454 (10Reedy) @elukey Can we get mw2256 depooled from the dsh lists etc? Will stop scap giving timeout errors for the host [23:31:24] !log tgr@tin Finished scap: T125917: Deploy PageViewInfo to testwiki (duration: 48m 58s) [23:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:31:29] T125917: Deploy the PageViewInfo extension to production - https://phabricator.wikimedia.org/T125917 [23:31:33] MaxSem, I have a -labs.php (Beta Cluster) wmf-config to deploy. Could you let me know when the prod wmf-configs are done so we dont' step on each other's toes. Or I can wait until SWAT is entirely done if you prefer. [23:32:10] scap sync apparently snyced wikiversions as well, so we are good [23:32:12] matt_flaschen, tgr is still deploying so swat is delayed [23:32:15] MaxSem: ^ [23:32:29] oh, so I can do the deed? [23:32:37] yeah [23:32:41] thanks [23:32:50] sorry for the delay [23:33:03] matt_flaschen, go ahead [23:33:17] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [23:33:22] I'll move my SWAT patches to another day to try to compensate a bit [23:33:35] Should still be time [23:33:51] (03PS1) 10Dzahn: RT: convert to role/profile-model [puppet] - 10https://gerrit.wikimedia.org/r/342771 [23:34:01] there are 8 patches though [23:34:37] Should be fine [23:35:40] matt_flaschen, if you can't do that right now, I can start scap [23:36:06] Why do you need to scap? :P [23:36:21] s/scap/swat/ [23:36:57] (03CR) 10MaxSem: [C: 032] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:37:31] (03CR) 10MaxSem: [C: 04-2] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:37:57] tgr, Undefined variable: wmgUsePageViewInfo in /srv/mediawiki/wmf-config/CommonSettings.php on line 3496 [23:38:24] Touch and sync IS? [23:39:52] MaxSem: probably a transient error? [23:40:16] InitializeSettings and CommonSettings not being rsynced at exactly the same time, or something [23:40:26] full scap to deploy new extension config is likely gonna do that [23:40:35] If CS is synced before IS... [23:40:46] yeah, initSettings should be file-sycned ahead of time [23:40:58] Or done after the scap [23:41:09] As long as the extension is in extension-list, that's the main thing to get the messages sorte [23:41:24] look in logstash, it's not falling off [23:41:34] touch IS and sync [23:41:36] See if it does [23:41:59] it's waiting for something [23:42:08] a host is down? [23:42:10] grrrr [23:42:20] Yeah, one in codfw with memory errors [23:42:29] the variable is set on InitializeSettings line 18306 [23:42:30] I noted on the install task that it needs removing from dsh [23:42:51] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: - (duration: 02m 50s) [23:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:02] we really need something better than dsh groups+puppet for pooling/depooling [23:43:41] I think j.oe has a plan for that [23:44:17] no more deployments! [23:45:07] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:45:07] okay, resolved [23:45:46] (03CR) 10MaxSem: [C: 032] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:46:07] MaxSem: Same for this one (IS first, then CS) - no scap though [23:46:13] yeah [23:46:28] (03CR) 10MaxSem: [V: 032 C: 032] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:46:35] I'll verify on debug when you're ready [23:46:44] (03PS5) 10MaxSem: Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:46:51] grrr [23:47:00] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/5773/cobalt.wikimedia.org/change.cobalt.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/342692 (owner: 10Dzahn) [23:47:46] (03CR) 10MaxSem: Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:47:51] (03CR) 10MaxSem: [C: 032] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:48:10] (03CR) 10MaxSem: [V: 032 C: 032] Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:48:28] (03CR) 10jenkins-bot: Disable WikimediaEvents extension on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342148 (https://phabricator.wikimedia.org/T158721) (owner: 10Krinkle) [23:49:25] Krinkle, pulled on mwdebug1002 [23:50:07] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 15 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:51:00] MaxSem: OK verifying [23:51:57] lol [23:52:05] 5 \nResting in the Hammock \nMONDAY .... \nMrs. Bufklns Had a Busy Day \n. 36 \n [23:52:11] MaxSem: Verified. ext.wikimediaEvents still registered and loaded on open wikis. Not loaded and not registered on aa.wikipedia [23:52:27] awsum [23:53:49] hmm, do we still need to distinguish between sync-file and sync-dir? [23:55:13] MaxSem: Nope [23:55:17] (03PS4) 10Dzahn: gerrit: convert to profile/role structure [puppet] - 10https://gerrit.wikimedia.org/r/342692 [23:55:26] :D [23:55:28] They do the same thing internally. sync-dir is just a back-compat alias for sync-file [23:55:52] just sync? [23:55:59] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/342148/ (duration: 02m 47s) [23:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:04] No, `scap sync-file` [23:56:09] For single files or dirs [23:56:26] `scap sync-dir` still works, but it's a hidden command so you won't see it in help [23:56:39] (03CR) 10Ladsgroup: "I have one major concern regarding using "testwiki" for "enwiki" in beta, otherwise look good to me." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342767 (owner: 10Mattflaschen) [23:58:50] (03Abandoned) 10Chad: Add gigantic warning to files/ directory [puppet] - 10https://gerrit.wikimedia.org/r/342653 (owner: 10Chad) [23:59:38] !log maxsem@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/342148/ (duration: 02m 47s) [23:59:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:49] Krinkle, please verify ^