[01:02:50] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1499648562 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4625940 keys, up 2 minutes 39 seconds - replication_delay is 1499648562 [01:02:50] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [01:02:50] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1499648567 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9238208 keys, up 2 minutes 45 seconds - replication_delay is 1499648567 [01:03:00] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6379 [01:03:40] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 4623930 keys, up 3 minutes 31 seconds - replication_delay is 0 [01:03:50] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9327959 keys, up 3 minutes 41 seconds - replication_delay is 0 [01:04:00] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9235359 keys, up 3 minutes 49 seconds - replication_delay is 0 [01:04:00] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9331493 keys, up 3 minutes 47 seconds - replication_delay is 0 [01:09:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:10:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:11:00] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [01:12:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:15:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [01:18:40] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:19:40] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:20:00] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:27:49] !log l10nupdate@tin scap failed: average error rate on 2/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [02:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:05:49] (03CR) 10KartikMistry: "Alex, is this can be deploy somewhere and we can do testing before it goes to Production?" [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/362334 (https://phabricator.wikimedia.org/T168857) (owner: 10KartikMistry) [04:03:00] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [04:08:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=767.40 Read Requests/Sec=4527.40 Write Requests/Sec=690.80 KBytes Read/Sec=29594.40 KBytes_Written/Sec=9622.40 [04:16:50] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=3.10 Read Requests/Sec=0.40 Write Requests/Sec=18.50 KBytes Read/Sec=2.40 KBytes_Written/Sec=137.20 [04:30:00] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [05:47:54] (03PS1) 10Marostegui: db-eqiad.php: Depool db1067 and db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364145 (https://phabricator.wikimedia.org/T166204) [05:55:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1067 and db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364145 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:58:14] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1067 and db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364145 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:58:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1067 and db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364145 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [05:59:21] !log marostegui@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [05:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1080, depool db1067 - T166204 (duration: 00m 42s) [06:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:44] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [06:02:00] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:02:00] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [06:07:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:09:00] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:10:24] 10Operations, 10Graphite, 10User-fgiunchedi: Delete "servers" metrics in graphite older than 60d - https://phabricator.wikimedia.org/T169972#3419361 (10Joe) Aren't we collecting all server metrics via prometheus? If that's the case, shouldn't we just drop the diamond collector for those metrics? [06:11:49] !log Deploy alter table on s1 - db1080 and db1067 - T166204 [06:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:00] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [06:19:45] 10Operations, 10Deployment-Systems, 10Performance-Team, 10HHVM, 10Release-Engineering-Team (Next): Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3419367 (10Joe) @Krinkle sure, we can enable reusing TC in beta for now and test if the f... [06:23:31] (03PS1) 10Giuseppe Lavagetto: deployment-prep: enable reusable TC on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/364148 (https://phabricator.wikimedia.org/T103886) [06:32:50] (03PS1) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364150 [06:35:43] 10Operations, 10Puppet, 10User-Joe: Prepare for Puppet 4 - https://phabricator.wikimedia.org/T169548#3419380 (10Joe) [06:35:45] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Add support for directory environments to our puppet classes, production puppetmaster - https://phabricator.wikimedia.org/T169485#3419377 (10Joe) 05Open>03Resolved p:05Triage>03High a:03Joe [06:42:32] 10Operations, 10ops-codfw, 10Performance-Team, 10Thumbor, 10User-fgiunchedi: Rename mw2148 / mw2149 / mw2259 / mw2260 to thumbor200[1234] - https://phabricator.wikimedia.org/T168881#3379407 (10Joe) I did deactivate those nodes, and me and filippo already added a "puppet node deactivate" to wmf-reimage, s... [06:43:10] (03CR) 10Ayounsi: [C: 032] Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [06:43:22] ACKNOWLEDGEMENT - nutcracker port on mw1228 is CRITICAL: Return code of 255 is out of bounds Giuseppe Lavagetto T168613 [06:43:43] (03PS6) 10Ayounsi: Add diffscan module. [puppet] - 10https://gerrit.wikimedia.org/r/363296 (https://phabricator.wikimedia.org/T169624) [06:47:37] (03CR) 10Muehlenhoff: [C: 04-1] "Already merged a similar patch last week" [puppet] - 10https://gerrit.wikimedia.org/r/363884 (owner: 10Dzahn) [06:48:48] ACKNOWLEDGEMENT - Host mw1196 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto Already inactive, will be decommissioned soon. [06:52:04] (03PS2) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364150 [06:56:18] 10Operations, 10Wikimedia-General-or-Unknown: Icinga has httpauth on (not accessible for public) - https://phabricator.wikimedia.org/T62112#3419405 (10akosiaris) >>! In T62112#3419066, @Luke081515 wrote: > Did something changed here in over two years since icinga is login-only? No, and given @MoritzMuehlenhof... [06:58:40] 10Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#2987822 (10Joe) This thing is alerting since 4 days as it's apparently using the default azure ssl cert. I am RADICALLY AGAINST monitoring such certificates/hosts if we're not... [07:13:51] !log reboot netmon1001 for kernel update [07:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:11] ACKNOWLEDGEMENT - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [07:18:11] ACKNOWLEDGEMENT - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [07:18:11] ACKNOWLEDGEMENT - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [07:18:11] ACKNOWLEDGEMENT - IPsec on kafka1018 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [07:18:11] ACKNOWLEDGEMENT - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [07:18:11] ACKNOWLEDGEMENT - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 116 connecting: cp3009_v4, cp3009_v6 Ema https://phabricator.wikimedia.org/T148422 [07:18:47] !log oblivian@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=cp3009.* [07:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:05] (03PS1) 10Giuseppe Lavagetto: icinga::monitor::certs: remove eventdonations [puppet] - 10https://gerrit.wikimedia.org/r/364158 [07:19:18] (03Abandoned) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364150 (owner: 10Marostegui) [07:21:37] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga::monitor::certs: remove eventdonations [puppet] - 10https://gerrit.wikimedia.org/r/364158 (owner: 10Giuseppe Lavagetto) [07:25:28] (03Restored) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364150 (owner: 10Marostegui) [07:35:19] (03Abandoned) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364150 (owner: 10Marostegui) [07:36:26] (03PS1) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364159 [07:37:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:38:11] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:38:16] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364159 (owner: 10Marostegui) [07:39:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:41:44] (03PS1) 10Ayounsi: Diffscan tuning [puppet] - 10https://gerrit.wikimedia.org/r/364161 (https://phabricator.wikimedia.org/T169624) [07:42:11] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:43:01] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 294 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:43:55] XioNoX: how's ulsfo looking network-wise? ^ [07:44:08] looking [07:44:11] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:44:15] (03PS2) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364159 [07:44:46] (03CR) 10Muehlenhoff: Diffscan tuning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364161 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [07:45:11] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:48:01] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 14 probes of 294 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:49:29] (03CR) 10Marostegui: "Puppet looks happy with this change and does what it is supposed to do: https://puppet-compiler.wmflabs.org/6979/" [puppet] - 10https://gerrit.wikimedia.org/r/364159 (owner: 10Marostegui) [07:52:01] ema: can't see anything out of the ordinary, that v6 ripe alert look like a transient provider issue *somewhere* [07:52:01] (03PS3) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364159 [07:52:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor comments inline. Setting those aside, overall LGTM and to PCC as well https://puppet-compiler.wmflabs.org/6977/" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/363870 (owner: 10Andrew Bogott) [07:53:05] XioNoX: ok, thanks for checking! [07:53:29] !log rebooting hafnium for kernel update [07:53:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:41] (03PS4) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364159 [07:59:33] (03PS2) 10Ayounsi: Diffscan tuning [puppet] - 10https://gerrit.wikimedia.org/r/364161 (https://phabricator.wikimedia.org/T169624) [08:00:38] (03CR) 10Ayounsi: Diffscan tuning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364161 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [08:01:53] (03CR) 10Muehlenhoff: [C: 031] Diffscan tuning [puppet] - 10https://gerrit.wikimedia.org/r/364161 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [08:02:20] ema: https://phabricator.wikimedia.org/T167689 will allow us to have better visibility on those ripe alerts [08:03:07] (03CR) 10Ayounsi: [C: 032] Diffscan tuning [puppet] - 10https://gerrit.wikimedia.org/r/364161 (https://phabricator.wikimedia.org/T169624) (owner: 10Ayounsi) [08:03:50] !log Drop database l10nwiki on s2 - T119811 [08:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:01] T119811: Drop localisation and localisation_file_hash tables, l10nwiki databases too - https://phabricator.wikimedia.org/T119811 [08:06:51] (03PS3) 10Alexandros Kosiaris: Adds ORES grafana alert to icinga config [puppet] - 10https://gerrit.wikimedia.org/r/363890 (https://phabricator.wikimedia.org/T167830) (owner: 10Halfak) [08:06:55] (03CR) 10Alexandros Kosiaris: [C: 032] Adds ORES grafana alert to icinga config [puppet] - 10https://gerrit.wikimedia.org/r/363890 (https://phabricator.wikimedia.org/T167830) (owner: 10Halfak) [08:06:59] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Adds ORES grafana alert to icinga config [puppet] - 10https://gerrit.wikimedia.org/r/363890 (https://phabricator.wikimedia.org/T167830) (owner: 10Halfak) [08:08:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364164 (https://phabricator.wikimedia.org/T168661) [08:08:32] (03PS3) 10Alexandros Kosiaris: grafana: Fix broken "Total requests per minute" panel [puppet] - 10https://gerrit.wikimedia.org/r/363882 (owner: 10Krinkle) [08:08:37] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] grafana: Fix broken "Total requests per minute" panel [puppet] - 10https://gerrit.wikimedia.org/r/363882 (owner: 10Krinkle) [08:10:03] 10Operations, 10DBA, 10Traffic, 10WMF-Legal, and 2 others: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#3419578 (10Aklapper) a:05Springle>03None [ Resetting assignee as assignee account is not active anymore ] [08:12:46] (03PS4) 10Elukey: use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [08:12:53] (03PS1) 10Alexandros Kosiaris: backup: Remove the day from the hourly schedule [puppet] - 10https://gerrit.wikimedia.org/r/364165 [08:13:17] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] backup: Remove the day from the hourly schedule [puppet] - 10https://gerrit.wikimedia.org/r/364165 (owner: 10Alexandros Kosiaris) [08:13:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364164 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:15:15] (03Abandoned) 10Ema: Depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/363597 (owner: 10Ema) [08:15:17] (03PS3) 10Muehlenhoff: Move ferm service out of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/363595 [08:15:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364164 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:16:09] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364164 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [08:16:14] !log marostegui@tin scap failed: average error rate on 2/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [08:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:10] (03PS5) 10Elukey: use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [08:17:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T168661 (duration: 00m 46s) [08:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:31] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [08:17:37] !log Deploy alter table on db1097 - T168661 [08:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:58] (03CR) 10Elukey: [C: 032] use 'require_package' for stats packages including python-yaml [puppet] - 10https://gerrit.wikimedia.org/r/363382 (owner: 10ArielGlenn) [08:23:11] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:25:41] (03CR) 10Volans: "@gehel: thanks for the review, reply inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/363747 (https://phabricator.wikimedia.org/T169640) (owner: 10Volans) [08:28:56] ACKNOWLEDGEMENT - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 35, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/2: down - Cust: Airport Express WiFi APBR Ayounsi T86541 [08:29:56] !log rebooting francium for kernel update [08:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:14] !log rebooting ms1001 for kernel update [08:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:26] 10Operations, 10Goal, 10Kubernetes: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3419676 (10akosiaris) [08:48:14] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3419717 (10ema) So, @mpopov's analysis seems to be based on all requests, including varnish cache hits. On the cache_text cluster, we're currently rate-limi... [08:51:43] 10Operations, 10Goal, 10Kubernetes: Implement a pod networking policy approach - https://phabricator.wikimedia.org/T170111#3419728 (10akosiaris) [08:52:51] !log rebooting mwlog2001 for kernel update [08:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:01] (03PS6) 10Giuseppe Lavagetto: Generalize state management, allow multiple run modes [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 [09:00:39] !log rebooting mw1168 (video scaler) for kernel update [09:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:07] 10Operations, 10Goal, 10Kubernetes: Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119#3419847 (10akosiaris) [09:04:30] 10Operations, 10Goal, 10Kubernetes: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3419676 (10akosiaris) p:05Triage>03Normal [09:06:16] 10Operations, 10vm-requests, 10Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3419864 (10jcrespo) No yet, I have not deleted the files on einsteinium. [09:10:15] 10Operations, 10Goal, 10Kubernetes: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3419870 (10akosiaris) [09:10:41] !log Compress innodb on wikidata on dbstore2001 [09:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:42] (03PS3) 10Addshore: WMDE Summer campaign - Add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362380 (https://phabricator.wikimedia.org/T168631) [09:13:56] (03PS4) 10Addshore: WMDE Summer campaign - Add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362380 (https://phabricator.wikimedia.org/T168631) [09:17:23] 10Operations, 10Goal, 10Kubernetes: Experiment with ingress solutions (stretch) - https://phabricator.wikimedia.org/T170121#3419889 (10akosiaris) [09:18:59] (03PS4) 10Muehlenhoff: Move ferm service out of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/363595 [09:21:50] 10Operations, 10Goal, 10Kubernetes: Experiment with ingress solutions (stretch) - https://phabricator.wikimedia.org/T170121#3419905 (10akosiaris) [09:22:37] 10Operations, 10Goal, 10Kubernetes: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3419908 (10akosiaris) [09:22:43] (03CR) 10Muehlenhoff: [C: 032] Move ferm service out of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/363595 (owner: 10Muehlenhoff) [09:31:14] (03PS1) 10Muehlenhoff: Remove ferm service for striker / 8081 [puppet] - 10https://gerrit.wikimedia.org/r/364174 (https://phabricator.wikimedia.org/T169070) [09:48:39] jouncebot next [09:48:39] In 0 hour(s) and 11 minute(s): WMDE Summer campaign (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T1000) [09:50:00] (03CR) 10Addshore: [C: 032] WMDE Summer campaign - Add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362380 (https://phabricator.wikimedia.org/T168631) (owner: 10Addshore) [09:50:21] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] [09:51:05] looking ^ [09:51:07] (03Merged) 10jenkins-bot: WMDE Summer campaign - Add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362380 (https://phabricator.wikimedia.org/T168631) (owner: 10Addshore) [09:51:19] (03CR) 10jenkins-bot: WMDE Summer campaign - Add logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362380 (https://phabricator.wikimedia.org/T168631) (owner: 10Addshore) [09:52:21] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [09:52:47] dcausse --^ [09:52:54] ah sorry :D [09:53:00] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364179 [09:53:02] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364179 [09:53:03] cirrussearch errors was a huge load spike (T169498), load is already going down [09:53:04] T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498 [09:53:10] (03PS1) 10Jcrespo: mariadb: Test new multiinstance dbstore role on an empty host [puppet] - 10https://gerrit.wikimedia.org/r/364180 (https://phabricator.wikimedia.org/T169514) [09:54:02] (03CR) 10Marostegui: [C: 031] mariadb: Test new multiinstance dbstore role on an empty host [puppet] - 10https://gerrit.wikimedia.org/r/364180 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [09:54:09] will try to deploy something this afternoon (we have some ideas to try but not yet 100% confident it'll fix this recurring problem) [09:54:46] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:362380|WMDE Summer campaign - Add logging]] (duration: 00m 45s) [09:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:35] anomie & maxsem, git log HEAD..origin/wmf/1.30.0-wmf.7 is showing 3 commits by you on tin [09:57:21] RECOVERY - Number of backend failures per minute from CirrusSearch on graphite1001 is OK: OK: Less than 20.00% above the threshold [300.0] [09:57:21] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [09:58:40] (03CR) 10Jcrespo: [C: 04-1] "Errors: https://puppet-compiler.wmflabs.org/6982/db1096.eqiad.wmnet/change.db1096.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/364180 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [09:59:10] !log rebooting mc2* servers for kernel update [09:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:06] addshore: Dear anthropoid, the time has come. Please deploy WMDE Summer campaign (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T1000). [10:00:06] addshore: A patch you scheduled for WMDE Summer campaign is about to be deployed. Please be available during the process. [10:00:11] o/ [10:02:38] (03PS2) 10Jcrespo: mariadb: Test new multiinstance dbstore role on an empty host [puppet] - 10https://gerrit.wikimedia.org/r/364180 (https://phabricator.wikimedia.org/T169514) [10:03:57] legoktm: looks like you didnt rebase when you deployed these things on friday afaik [10:07:41] mhhhm [10:10:52] Guess I'll revert the 3 of them for now then [10:13:44] !log reverting https://gerrit.wikimedia.org/r/#/c/363891 as it is sitting on tin undeployed T169261 [10:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:55] T169261: Users unable to remain logged in, associated with attempts to upgrade the password hash on every login - https://phabricator.wikimedia.org/T169261 [10:14:30] On a scale of nope to 10, how easy would it be for someone to rename my Phab account? [10:14:39] (03PS1) 10DCausse: [cirrus] Enable the token_count_router only for chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364183 (https://phabricator.wikimedia.org/T169498) [10:15:22] (from memory phab doesn't recommend deleting accounts so I'm not sure how a rename would affect it) [10:21:07] !log addshore@tin Synchronized php-1.30.0-wmf.7/extensions/CentralAuth: CentralAuth (undeployed patches) [[gerrit:363892]], [[gerrit:363893]], [[gerrit:363891]] & revert [[gerrit:364182]] T169261 (duration: 00m 47s) [10:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:18] T169261: Users unable to remain logged in, associated with attempts to upgrade the password hash on every login - https://phabricator.wikimedia.org/T169261 [10:22:52] !log addshore@tin Synchronized php-1.30.0-wmf.7/extensions/WikimediaEvents: [[gerrit:364172|WMDE Summer campaign - Add hook]] (duration: 00m 42s) [10:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:05] (03PS1) 10Addshore: Fix spaces ot tabs for WMDE log line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364184 [10:25:13] (03CR) 10Addshore: [C: 032] Fix spaces ot tabs for WMDE log line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364184 (owner: 10Addshore) [10:25:40] (03PS2) 10Addshore: Fix spaces to tabs for WMDE log line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364184 [10:25:45] (03CR) 10Addshore: [C: 032] Fix spaces to tabs for WMDE log line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364184 (owner: 10Addshore) [10:26:53] (03Merged) 10jenkins-bot: Fix spaces to tabs for WMDE log line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364184 (owner: 10Addshore) [10:27:02] (03CR) 10jenkins-bot: Fix spaces to tabs for WMDE log line [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364184 (owner: 10Addshore) [10:27:51] (03PS1) 10Muehlenhoff: Restrict ores::web to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/364185 [10:28:12] !log addshore@tin Synchronized wmf-config/InitialiseSettings.php: [[gerrit:364184|WMDE Summer campaign - Add logging]] (fix spacing) NOOP (duration: 00m 43s) [10:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:29] !log WMDE Summer campaign deploy slot DONE [10:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:39] (03CR) 10Alexandros Kosiaris: [C: 031] Restrict ores::web to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/364185 (owner: 10Muehlenhoff) [10:34:11] PROBLEM - IPsec on mc1023 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2023_v4 [10:37:38] ^ mc2023 reboot was stuck, should recover soon [10:38:29] 10Operations, 10MW-1.30-release-notes, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review: Create Atikamekw Wikipedia - https://phabricator.wikimedia.org/T167714#3345502 (10Urbanecm) Is this done now or is there anything to do ATM? [10:43:53] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Patch-For-Review, 10User-Urbanecm: Create Dinka Wikipedia - https://phabricator.wikimedia.org/T168518#3420142 (10Esc3300) [10:47:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:49:21] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [10:49:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:52:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [10:55:03] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3420148 (10ema) [10:55:13] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3420163 (10ema) p:05Triage>03High [10:55:21] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:56:51] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:57:21] RECOVERY - IPsec on mc1023 is OK: Strongswan OK - 1 ESP OK [11:03:24] 10Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3420193 (10Tgr) [11:08:11] PROBLEM - IPsec on mc1025 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2025_v4 [11:13:35] !log kartik@tin Started deploy [cxserver/deploy@c209bec]: Update cxserver to 3375da5 [11:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:59] 10Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 10Patch-For-Review, and 2 others: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874#3420207 (10Tgr) a:05Tgr>03None See patch above. Ne... [11:14:11] RECOVERY - IPsec on mc1025 is OK: Strongswan OK - 1 ESP OK [11:15:16] 10Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 10Patch-For-Review, and 2 others: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf - https://phabricator.wikimedia.org/T150874#3420211 (10Tgr) Sample files can be seen in T168871 an... [11:16:25] !log kartik@tin Finished deploy [cxserver/deploy@c209bec]: Update cxserver to 3375da5 (duration: 02m 49s) [11:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:45] !log installing libgcrypt and expat security updates [11:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:20] 10Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3420222 (10Tgr) a:05Tgr>03None Work on this happened in subtasks: * {T168004} * {T168871} * {T169897} See a... [11:35:03] (03PS3) 10Marostegui: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364179 [11:36:13] !log Stop MySQL on db1102 for maintenance - T153743 [11:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:23] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [11:37:25] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364179 (owner: 10Marostegui) [11:38:34] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364179 (owner: 10Marostegui) [11:38:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364179 (owner: 10Marostegui) [11:39:04] (03CR) 10Marostegui: [C: 032] mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364159 (owner: 10Marostegui) [11:39:11] (03PS5) 10Marostegui: mariadb: Create sanitarium3 role [puppet] - 10https://gerrit.wikimedia.org/r/364159 [11:39:33] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore original weight for db1097 - T168661 (duration: 00m 42s) [11:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:44] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [11:41:49] (03Abandoned) 10Urbanecm: Fix rate limit configuration for plwiki - ratelimit thanks-notification [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363011 (https://phabricator.wikimedia.org/T169268) (owner: 10Urbanecm) [11:46:10] (03CR) 10MarcoAurelio: "> Also, shouldn't we make some unit tests?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363370 (https://phabricator.wikimedia.org/T168727) (owner: 10MarcoAurelio) [11:52:32] jouncebot: next [11:52:32] In 1 hour(s) and 7 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T1300) [12:02:20] !log installing bind security updates (we only have client libs/tools installed) [12:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:54] !log Upgrade db1102 to 10.1 and enable rbr triggers - T153743 [12:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:04] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [12:05:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364189 (https://phabricator.wikimedia.org/T168661) [12:07:01] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364189 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:08:19] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364189 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:08:21] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [12:08:28] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364189 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [12:08:57] (03PS2) 10Muehlenhoff: Reimage mw2118 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/358943 [12:10:57] (03CR) 10Muehlenhoff: [C: 032] Reimage mw2118 with jessie [puppet] - 10https://gerrit.wikimedia.org/r/358943 (owner: 10Muehlenhoff) [12:11:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1091 - T168661 (duration: 00m 42s) [12:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:36] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [12:12:11] PROBLEM - DPKG on thumbor1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:12:22] !log Deploy alter table on db1091 - T168661 [12:12:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:11] RECOVERY - DPKG on thumbor1003 is OK: All packages OK [12:18:17] (03PS2) 10Marostegui: db-eqiad.php: db1079 as sanitarium3 master for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363815 (https://phabricator.wikimedia.org/T153743) [12:19:38] (03CR) 10Marostegui: [C: 032] db-eqiad.php: db1079 as sanitarium3 master for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363815 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:19:46] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3420358 (10daniel) @aaron I'm trying to wrap my head around the code in Title::invalidateCa... [12:20:48] (03Merged) 10jenkins-bot: db-eqiad.php: db1079 as sanitarium3 master for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363815 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:20:50] (03CR) 10jenkins-bot: db-eqiad.php: db1079 as sanitarium3 master for s7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363815 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [12:21:51] 10Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad air filters - https://phabricator.wikimedia.org/T170138#3420362 (10faidon) [12:22:00] !log marostegui@tin scap failed: average error rate on 4/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [12:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:49] !log installing xorg-server security updates [12:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:10] !log marostegui@tin scap failed: average error rate on 2/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [12:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:21] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3420378 (10daniel) @jcrespo Looking at T169884, it seems that it's an unrelated issue trigg... [12:26:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: db1079 to become master for sanitarium3 - T153743 (duration: 00m 41s) [12:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:23] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [12:37:31] 10Operations, 10monitoring: setup netmon1002.wikimedia.org - https://phabricator.wikimedia.org/T159756#3420393 (10faidon) [12:38:45] 10Operations, 10monitoring, 10Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#3420397 (10faidon) [12:40:47] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3420400 (10faidon) [12:44:56] 10Operations, 10monitoring, 10Patch-For-Review: check_hpssacli should report on battery failures and cache disabled - https://phabricator.wikimedia.org/T163998#3420403 (10faidon) 05Open>03Resolved a:03faidon This has been fixed for a while. [12:49:17] 10Operations, 10monitoring, 10Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3420415 (10faidon) [12:49:19] 10Operations, 10monitoring: labvirt1008/labsdb1001: FreeIPMI returned an empty header map - https://phabricator.wikimedia.org/T167138#3420408 (10faidon) 05Open>03Resolved a:03faidon labsdb1001 is one of two Ciscos remaining in our fleet (labsdb1003 being the other one). They're old and their BIOS/IPMI im... [12:52:24] (03PS1) 10Elukey: role::piwik::server: add regular bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) [12:54:38] (03PS1) 10Elukey: Add fake Piwik backup user/password [labs/private] - 10https://gerrit.wikimedia.org/r/364196 [12:55:08] (03CR) 10Elukey: [V: 032 C: 032] Add fake Piwik backup user/password [labs/private] - 10https://gerrit.wikimedia.org/r/364196 (owner: 10Elukey) [12:55:15] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3420420 (10faidon) [12:55:17] 10Operations, 10monitoring, 10Patch-For-Review: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3420426 (10faidon) [12:55:19] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3318229 (10faidon) 05Open>03Resolved a:03faidon I just checked the list above one by one. All of them work now, with the exception of sodiu... [12:57:21] 10Operations: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3420428 (10Jgreen) [12:57:40] 10Operations: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3420440 (10Jgreen) [12:57:59] 10Operations, 10monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3420442 (10faidon) p:05High>03Normal So the IPMI checks have been deployed for a while. Quite a few hosts had BMC issues (some of them are fixed), and it remains to be seen whether the IPMI checks are go... [12:58:29] !log Run redact_sanitarium on s2 and s6 - db1102 - T153743 [12:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:41] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [12:59:31] 10Operations, 10monitoring: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3420449 (10faidon) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T1300). Please do the needful. [13:00:04] dcausse, Urbanecm, and TabbyCat: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:22] o/ [13:00:48] 10Operations: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3420458 (10Jgreen) [13:01:13] o/ [13:01:27] (03CR) 10Elukey: "@Alex: I tried to follow your suggestions in T164073 and I came up with this version, https://puppet-compiler.wmflabs.org/6984/bohrium.eqi" [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [13:02:00] 10Operations, 10MediaWiki-Debug-Logger, 10Wikimedia-Logstash, 10monitoring: MediaWiki logging & encryption - https://phabricator.wikimedia.org/T126989#3420472 (10faidon) [13:02:11] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 2041852 [13:02:17] o/ o/ [13:02:18] 10Operations, 10monitoring, 10User-fgiunchedi: encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#3420474 (10faidon) [13:03:00] anybody wants to deploy their own changes? anybody wants to do the swat? [13:03:10] RainbowSprinkles wants to swat :P [13:03:44] zeljkof: I cannot deploy my own changes so if you want to swat those I'm okay with it [13:03:52] 10Operations, 10Graphite, 10monitoring, 10MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), 10MW-1.27-release-notes: udp rcvbuferrors and inerrors on graphite1001 - https://phabricator.wikimedia.org/T101141#3420480 (10faidon) [13:04:04] !log milimetric@tin Started deploy [analytics/refinery@c22eb93]: Update Sqoop with better parallelism [13:04:08] RainbowSprinkles should be sleeping now, not deploying :) [13:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:15] heh [13:04:33] ok, in that case, for the record: I can SWAT today! [13:06:06] (03PS17) 10Mforns: Add white-list for EventLogging auto-purging [puppet] - 10https://gerrit.wikimedia.org/r/298721 (https://phabricator.wikimedia.org/T108850) [13:06:07] will start with dcausse's commit first, reviewing [13:06:20] 10Operations, 10monitoring, 10Graphite, 10User-fgiunchedi: Audit groups of metrics in Graphite that allocate a lot of disk space - https://phabricator.wikimedia.org/T1075#3420483 (10faidon) [13:06:58] !log milimetric@tin Finished deploy [analytics/refinery@c22eb93]: Update Sqoop with better parallelism (duration: 02m 54s) [13:07:02] 10Operations, 10monitoring, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Effects on adjusting Prometheus retention - https://phabricator.wikimedia.org/T160677#3420497 (10faidon) [13:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:44] 10Operations, 10DBA, 10monitoring, 10Prometheus-metrics-monitoring: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb - https://phabricator.wikimedia.org/T145072#3420498 (10faidon) [13:08:08] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#3420499 (10faidon) [13:08:16] zeljkof: I'm here for SWAT now [13:08:38] 10Operations, 10monitoring, 10Prometheus-metrics-monitoring: Move prometheus entry point off port 80 - https://phabricator.wikimedia.org/T152445#3420502 (10faidon) [13:08:51] Urbanecm: ok, you are next in queue [13:08:54] 10Operations, 10Traffic: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3420503 (10Peachey88) [13:09:03] zeljkof: Ok, waiting for your ping. [13:09:26] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364183 (https://phabricator.wikimedia.org/T169498) (owner: 10DCausse) [13:09:32] PROBLEM - Apache HTTP on mw2216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:18] swat might be a bit slower today, for a couple of reasons: #1 I am slow in general, #2 my network is working but a bit slow today [13:10:21] RECOVERY - Apache HTTP on mw2216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.132 second response time [13:10:38] 10Operations, 10monitoring: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3420517 (10faidon) What's left to be done here, @Dzahn? [13:10:40] (03Merged) 10jenkins-bot: [cirrus] Enable the token_count_router only for chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364183 (https://phabricator.wikimedia.org/T169498) (owner: 10DCausse) [13:10:49] (03CR) 10jenkins-bot: [cirrus] Enable the token_count_router only for chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364183 (https://phabricator.wikimedia.org/T169498) (owner: 10DCausse) [13:12:48] 10Operations, 10Icinga, 10monitoring: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3420521 (10Peachey88) [13:14:31] dcausse: can you test 364183 at mwdebug? [13:14:39] zeljkof: yes [13:14:54] dcausse: will be there in a few seconds, will ping you [13:15:02] o [13:15:05] k [13:15:14] (03PS25) 10Mforns: role::mariadb::analytics::custom_repl_slave: add eventlogging_cleaner.py [puppet] - 10https://gerrit.wikimedia.org/r/356383 (https://phabricator.wikimedia.org/T108850) (owner: 10Elukey) [13:15:24] 10Operations, 10Traffic: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3420538 (10Jgreen) [13:15:26] 10Operations, 10Icinga, 10monitoring: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3420539 (10Jgreen) [13:15:28] 10Operations, 10Fundraising-Backlog, 10Technical-Debt: Determine if benefactorevents.wikimedia.org should be hosted on the production cluster or still on Microsoft Azure - https://phabricator.wikimedia.org/T166240#3420540 (10Jgreen) [13:15:31] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3420541 (10Jgreen) [13:15:56] dcausse: it's there, please test and let me know if I can continue [13:16:03] testing [13:16:05] (03PS1) 10MarcoAurelio: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) [13:16:22] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10Jgreen) [13:16:40] 10Operations, 10monitoring, 10netops: Evaluate NetBox as a Racktables replacement & IPAM - https://phabricator.wikimedia.org/T170144#3420547 (10faidon) [13:16:53] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#2201391 (10Jgreen) [13:17:15] (03CR) 10Jcrespo: [C: 04-2] "Let's block this, probably will be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [13:17:30] (03PS3) 10Jcrespo: mariadb: Test new multiinstance dbstore role on an empty host [puppet] - 10https://gerrit.wikimedia.org/r/364180 (https://phabricator.wikimedia.org/T169514) [13:17:39] 10Operations, 10Traffic: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches - https://phabricator.wikimedia.org/T168919#3420564 (10Ottomata) Yup! I just got back from vacation, but baring some unknown blocker, plan to turn it off this week. (still checking emails though!) [13:17:56] zeljkof: works well on mwdebug1002 [13:17:58] 10Operations, 10Icinga, 10monitoring: Create "network" icinga group - https://phabricator.wikimedia.org/T167279#3420565 (10faidon) [13:18:08] Urbanecm: can you test 364131 at mwdebug? (once it is there) [13:18:20] dcausse: ok, pushing to the intertubes then [13:18:25] thanks! [13:19:03] zeljkof: I don't have import rights there so I can't. [13:19:24] (03PS2) 10Elukey: Use parallelism to sqoop large tables [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782) (owner: 10Milimetric) [13:19:40] (03CR) 10Jcrespo: [C: 032] mariadb: Test new multiinstance dbstore role on an empty host [puppet] - 10https://gerrit.wikimedia.org/r/364180 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [13:20:05] !log zfilipin@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT: [[gerrit:364183|[cirrus] Enable the token_count_router only for chinese (T169498)]] (duration: 00m 43s) [13:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:15] T169498: Investigate load spikes on the elasticsearch cluster in eqiad - https://phabricator.wikimedia.org/T169498 [13:20:33] zeljkof: thanks! [13:20:37] dcausse: 364131 is deployed [13:20:46] Urbanecm: so, I should just push to prod? [13:21:14] Urbanecm: reviewing 364131 [13:21:21] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3420572 (10Jgreen) [13:21:23] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3420570 (10Jgreen) 05Open>03Resolved Closing this task because the remaining non-compliant site, benefactorevents.wikimedia.org, has been shut down. [13:21:29] zeljkof: Yes. [13:21:30] (03CR) 10Elukey: [C: 032] Use parallelism to sqoop large tables [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782) (owner: 10Milimetric) [13:21:37] (03PS3) 10Elukey: Use parallelism to sqoop large tables [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782) (owner: 10Milimetric) [13:21:39] (03CR) 10Elukey: [V: 032 C: 032] Use parallelism to sqoop large tables [puppet] - 10https://gerrit.wikimedia.org/r/363846 (https://phabricator.wikimedia.org/T169782) (owner: 10Milimetric) [13:22:13] Urbanecm: will do [13:22:21] Thank you [13:22:29] (03PS2) 10Zfilipin: Add import sources for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364131 (https://phabricator.wikimedia.org/T170094) (owner: 10Urbanecm) [13:23:11] (03PS3) 10MarcoAurelio: Set $wgCategoryCollation to 'uca-default' for fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) [13:23:21] PROBLEM - puppet last run on db1096 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:24:35] (03CR) 10Ottomata: [C: 031] role::analytics_cluster::hadoop::master: add icinga check for HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/363307 (https://phabricator.wikimedia.org/T163909) (owner: 10Elukey) [13:24:38] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364131 (https://phabricator.wikimedia.org/T170094) (owner: 10Urbanecm) [13:24:59] Urbanecm: merging the commit, can you test it at all? [13:25:09] I mean, once it is deployed to prod? [13:25:13] (03PS1) 10Jcrespo: mariadb: Fix dbstore3 my.cnf template filename (typo) [puppet] - 10https://gerrit.wikimedia.org/r/364199 (https://phabricator.wikimedia.org/T169514) [13:25:16] zeljkof: I don't have rights for it. [13:25:18] (03PS2) 10MarcoAurelio: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) [13:25:23] Urbanecm: this will be fun :) [13:25:34] I do [13:25:42] mind if I test for you? [13:25:43] (03PS3) 10Elukey: role::analytics_cluster::hadoop::master: add icinga check for HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/363307 (https://phabricator.wikimedia.org/T163909) [13:25:48] TabbyCat: That's great. Not at all [13:25:49] (03Merged) 10jenkins-bot: Add import sources for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364131 (https://phabricator.wikimedia.org/T170094) (owner: 10Urbanecm) [13:25:52] TabbyCat: oh, great, please do, will ping you in a minute [13:25:58] :) [13:26:01] PROBLEM - IPsec on mc1026 is CRITICAL: Strongswan CRITICAL - ok: 0 not-conn: mc2026_v4 [13:26:14] (03CR) 10jenkins-bot: Add import sources for specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364131 (https://phabricator.wikimedia.org/T170094) (owner: 10Urbanecm) [13:26:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Overall looks ok, comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [13:26:39] TabbyCat can you test your patches at mwdebug? [13:26:49] (once they are there) [13:26:57] zeljkof: sure [13:26:58] thanks akosiaris!! [13:27:06] 1001 or 1002? [13:27:37] 10Operations, 10Traffic, 10netops: Recurring varnish-be fetch failures in codfw - https://phabricator.wikimedia.org/T170131#3420580 (10ayounsi) Some of those spikes seem to match with OSPF flaps triggered by BFD. The link between cr2-codfw and cr2-eqiad stay up, no packet loss, but for some reasons BFD occas... [13:28:01] RECOVERY - IPsec on mc1026 is OK: Strongswan OK - 1 ESP OK [13:28:08] !log Disable puppet on db1102 to run check_private_data - T153743 [13:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:20] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [13:29:23] TabbyCat: 364131 is at mwdebug, please test and let me know if I can continue [13:29:36] (03CR) 10Elukey: [C: 032] role::analytics_cluster::hadoop::master: add icinga check for HDFS topology [puppet] - 10https://gerrit.wikimedia.org/r/363307 (https://phabricator.wikimedia.org/T163909) (owner: 10Elukey) [13:29:38] zeljkof: which mwdebug? [13:29:39] zeljkof: which mwdebug? [13:29:42] lol [13:29:44] !log installing graphite2 security updates (image lib) [13:29:44] lol [13:29:46] xD [13:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:55] Sagan: copyright :P [13:30:05] TabbyCat: at my client, I was first :P [13:30:07] TabbyCat, Sagan: sorry, mwdebug1002 :) [13:30:12] testing [13:30:25] 10Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#3420588 (10Jgreen) >>! In T156850#3419411, @Joe wrote: > This thing is alerting since 4 days as it's apparently using the default azure ssl cert. > > I am RADICALLY AGAINST mo... [13:31:02] zeljkof: looks good to me -- transwiki import feature now appears and sources as requested [13:31:17] TabbyCat: thanks, deploying [13:31:32] (03CR) 10Jcrespo: [C: 032] mariadb: Fix dbstore3 my.cnf template filename (typo) [puppet] - 10https://gerrit.wikimedia.org/r/364199 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [13:31:37] (03PS2) 10Jcrespo: mariadb: Fix dbstore3 my.cnf template filename (typo) [puppet] - 10https://gerrit.wikimedia.org/r/364199 (https://phabricator.wikimedia.org/T169514) [13:31:39] community can test later if import happens properly [13:31:39] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3420592 (10Ladsgroup) a:05Ladsgroup>03daniel [13:32:41] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:32:57] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:364131|Add import sources for specieswiki (T170094)]] (duration: 00m 43s) [13:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:07] T170094: Configuring importers on Wikispecies. - https://phabricator.wikimedia.org/T170094 [13:33:14] 10Operations, 10DBA, 10monitoring, 10Patch-For-Review, 10Prometheus-metrics-monitoring: MySQL monitoring with prometheus - https://phabricator.wikimedia.org/T143896#3420611 (10jcrespo) [13:34:09] Urbanecm, TabbyCat: 364131 is deployed, please test if it looks ok [13:34:17] testing again [13:34:41] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [13:34:54] TabbyCat: reviewing 363740 [13:34:55] (03CR) 10Ottomata: "Retroactive +1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/361497 (https://phabricator.wikimedia.org/T167670) (owner: 10Ppchelko) [13:35:05] looks good [13:35:06] okay [13:35:21] PROBLEM - HHVM rendering on mw2217 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:37] (03PS3) 10Rush: Revert "phabricator: Block IP ranges for recent uploaded offtopic files" [puppet] - 10https://gerrit.wikimedia.org/r/363356 (owner: 10Aklapper) [13:35:59] (03PS4) 10MarcoAurelio: Set $wgCategoryCollation to 'uca-default' for fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) [13:36:01] (03PS8) 10Alexandros Kosiaris: monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 [13:36:11] RECOVERY - HHVM rendering on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 78643 bytes in 0.362 second response time [13:36:52] (03CR) 10Alexandros Kosiaris: [C: 032] "I think I 've addresses Antoine's comments in PS7. I 'll merge, we can always improve the spec of course" [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [13:36:54] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] monitoring::host provide basic Rspec [puppet] - 10https://gerrit.wikimedia.org/r/363186 (owner: 10Alexandros Kosiaris) [13:37:20] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3420641 (10jcrespo) db2044 seems to have random phases of UNKNOWN state: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=db20... [13:37:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) (owner: 10MarcoAurelio) [13:38:03] (03PS9) 10Andrew Bogott: Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870 [13:39:15] (03Merged) 10jenkins-bot: Set $wgCategoryCollation to 'uca-default' for fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) (owner: 10MarcoAurelio) [13:39:17] (03CR) 10Rush: [C: 031] "I have gotten irc signoff from ema (re status of 363264), mukunda and andre so I'm rolling back these bans as they are affecting legit use" [puppet] - 10https://gerrit.wikimedia.org/r/363356 (owner: 10Aklapper) [13:39:24] (03CR) 10jenkins-bot: Set $wgCategoryCollation to 'uca-default' for fr.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) (owner: 10MarcoAurelio) [13:39:26] (03PS4) 10Rush: Revert "phabricator: Block IP ranges for recent uploaded offtopic files" [puppet] - 10https://gerrit.wikimedia.org/r/363356 (owner: 10Aklapper) [13:41:30] (03CR) 10Rush: [C: 032] Revert "phabricator: Block IP ranges for recent uploaded offtopic files" [puppet] - 10https://gerrit.wikimedia.org/r/363356 (owner: 10Aklapper) [13:41:33] TabbyCat: 363740 is at mwdebug1002, please test [13:41:49] zeljkof: change is not really testable [13:42:08] TabbyCat: ok, in that case pushing to prod [13:42:17] it requires updateCollation.php to see the changes [13:42:18] k [13:42:25] What's the worst that could happen (TM) [13:42:59] TabbyCat: ok, so first the deploy, then the script, right? [13:43:00] !log elastic@eqiad banning elastic1018 & elastic1021 to rebalance heavy shards [13:43:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:26] zeljkof: yes, first deploy, check wiki and then script on terbium [13:44:19] TabbyCat: what needs to be checked? the wiki is up at all? :) [13:44:24] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:363740|Set $wgCategoryCollation to uca-default for fr.wiktionary (T169810)]] (duration: 00m 42s) [13:44:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:35] T169810: The French Wiktionary requires to set "$wgCategoryCollation" to "uca-default" - https://phabricator.wikimedia.org/T169810 [13:45:15] zeljkof: see if we broke something in production but the wiki is still up and running so you can run that script now or after swat if you want [13:45:55] TabbyCat: I'll run the script now, do you think it would take a long time? [13:46:13] not sure that I have understood you what needs to be checked :/ [13:46:24] zeljkof: no idea, I supose it depends on how many categories need to be updated [13:46:25] (03PS4) 10Giuseppe Lavagetto: Add coverage report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363350 [13:46:27] (03PS4) 10Giuseppe Lavagetto: Raise test coverage percentage [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363351 [13:46:29] (03PS2) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [13:46:32] zeljkof: forget about the check [13:47:33] 11447021 rows [13:47:37] Yeah, the script will take quite a while [13:48:09] Reedy: so terbium or other server? [13:48:12] Reedy: should I run it now, or later? [13:48:29] Doesn't matter too much. But it needs running in a screen/similar. Probably on terbium, yes [13:49:01] zeljkof: note that the next two swat patches also need maintenance script running [13:49:04] !log labstore2003:~# umount -fl /srv/backup/tools (for T169774 recovery) [13:49:11] maybe we can let it run after we're done? [13:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:16] T169774: 2017-07-02 Toolforge data loss for permissive data - https://phabricator.wikimedia.org/T169774 [13:49:36] (03PS1) 10Jcrespo: mariadb:Change instance template identifiers for them to be dynamic [puppet] - 10https://gerrit.wikimedia.org/r/364203 (https://phabricator.wikimedia.org/T169514) [13:50:01] (03PS3) 10MarcoAurelio: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) [13:50:07] (03PS2) 10Jcrespo: mariadb:Change instance template identifiers for them to be dynamic [puppet] - 10https://gerrit.wikimedia.org/r/364203 (https://phabricator.wikimedia.org/T169514) [13:50:43] TabbyCat, Reedy: ok, I'll need help with running the script, do we have any docs on running the scripts with screen or something like that? [13:50:47] this is all I know https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Maintenance_Scripts [13:50:58] it would be great if you could update the docs [13:51:05] screen mwscript... [13:51:15] It's nothing WMF specific [13:51:31] Reedy: oh, ok, I thought it would be something more complicated [13:51:44] but it would still be great if you could add it to the docs... [13:52:01] not all of us speak bash natively :) [13:52:56] (03CR) 10Jcrespo: [C: 032] mariadb:Change instance template identifiers for them to be dynamic [puppet] - 10https://gerrit.wikimedia.org/r/364203 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [13:53:50] (03PS1) 10Elukey: Change Piwik's backup user/pass namespace to improve consistency [labs/private] - 10https://gerrit.wikimedia.org/r/364204 [13:53:53] (03PS7) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [13:54:06] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [13:54:13] Reedy, TabbyCat: is this what I need to do? zfilipin@terbium:~$ screen mwscript Manual:UpdateCollation.php [13:54:27] hmm.. I'd say no [13:54:44] mwscript updateCollation.php --wiki=frwiktionary ? [13:55:00] Probably want to tell it the old collation too [13:55:09] (03PS2) 10Elukey: Change Piwik's backup user/pass namespace to improve consistency [labs/private] - 10https://gerrit.wikimedia.org/r/364204 [13:55:29] (03CR) 10Elukey: [V: 032 C: 032] Change Piwik's backup user/pass namespace to improve consistency [labs/private] - 10https://gerrit.wikimedia.org/r/364204 (owner: 10Elukey) [13:55:31] so this? zfilipin@terbium:~$ screen mwscript Manual:UpdateCollation.php --wiki=frwiktionary [13:56:09] Reedy: yes, it says it'll speed-up the running [13:56:34] zeljkof: Why there is the Manual: string? I think it should be removed before running [13:57:07] (03CR) 10Ottomata: [C: 031] Update hadoop fair scheduler queues [puppet] - 10https://gerrit.wikimedia.org/r/362151 (https://phabricator.wikimedia.org/T156841) (owner: 10Joal) [13:57:25] Urbanecm: ouch, sorry, copy/pasted from the url [13:57:47] mwscript updateCollation.php --wiki=frwiktionary --previous-collation=uppercase ?? [13:57:59] Niharika: ping [13:58:01] (03PS2) 10Elukey: role::piwik::server: add regular bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) [13:58:25] Reedy: does that look good? ^ [13:58:26] zeljkof: we can run it later if you wish [13:58:49] TabbyCat: I would like to at least know exactly what to do now :) [13:59:07] zeljkof: **I** would say [13:59:10] mwscript updateCollation.php --wiki=frwiktionary --previous-collation=uppercase ?? [13:59:11] and I would prefer to run it while people are around, I really have almost no experience with running scripts [13:59:15] w/o ?? [13:59:35] but looking at the code I don't see that the --wiki parameter is defined [13:59:39] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Performance-Team, and 6 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3420723 (10daniel) @aaron another question: does RefreshLinksJob also purge the CDN cache a... [13:59:40] so I'm a bit lost as well [14:00:01] Dereckson: ? [14:00:10] TabbyCat: It is caught by mwscript [14:00:18] TabbyCat: can you run the script later? (when we figure out what to do) or should I do it? [14:00:23] (03PS8) 10Giuseppe Lavagetto: mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 [14:00:34] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [14:00:38] zeljkof: I do not have access to do that, sorry [14:00:41] Which is a shell wrapper for some PHP script but I don't remember the name. [14:00:44] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:01:14] !log elastic@eqiad unbanning elastic1018 & elastic1021 [14:01:15] I though about requesting access but it strikes me that I have to disclose my name to some unkown people [14:01:20] (03PS1) 10Faidon Liambotis: icinga: merge routers/switches monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/364206 (https://phabricator.wikimedia.org/T167279) [14:01:22] (03PS1) 10Faidon Liambotis: icinga: move RIPE Atlas host monitoring under netops [puppet] - 10https://gerrit.wikimedia.org/r/364207 (https://phabricator.wikimedia.org/T167279) [14:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:23] so I prefer to keep asking around [14:01:24] (03PS1) 10Faidon Liambotis: icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 [14:01:45] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: fixes for the future parser [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/356539 (owner: 10Giuseppe Lavagetto) [14:01:47] ofc I could lie and say that I'm marco mcmarcoaurelioface [14:01:55] but I'm not that type of person [14:02:19] could anything go wrong if we do not run the script for a while? [14:02:32] zeljkof: nope [14:02:33] Categories will be wrongly organised [14:02:40] for a while [14:03:00] (03CR) 10jerkins-bot: [V: 04-1] icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 (owner: 10Faidon Liambotis) [14:03:34] Reedy: you look like you know the most about running the script, is this what I should do? [14:03:34] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [14:04:04] Reedy: zfilipin@terbium:~$ screen mwscript updateCollation.php --wiki=frwiktionary --previous-collation=uppercase [14:04:11] (03CR) 10Andrew Bogott: [C: 032] Toolforge: Update motd banners for rebranding [puppet] - 10https://gerrit.wikimedia.org/r/364007 (https://phabricator.wikimedia.org/T168480) (owner: 10BryanDavis) [14:04:20] (03PS2) 10Andrew Bogott: Toolforge: Update motd banners for rebranding [puppet] - 10https://gerrit.wikimedia.org/r/364007 (https://phabricator.wikimedia.org/T168480) (owner: 10BryanDavis) [14:04:45] (03PS1) 10Jcrespo: mariadb: Create /etc/mysql/mysqld.conf.d directory [puppet] - 10https://gerrit.wikimedia.org/r/364209 (https://phabricator.wikimedia.org/T169514) [14:04:48] TabbyCat: eu swat window is over, do you mind moving the remaining patches to another window? [14:05:08] especially since all of them require scripts to be ran [14:05:09] if we must... [14:05:12] (03PS2) 10Jcrespo: mariadb: Create /etc/mysql/mysqld.conf.d directory [puppet] - 10https://gerrit.wikimedia.org/r/364209 (https://phabricator.wikimedia.org/T169514) [14:05:18] TabbyCat: I would really prefer that [14:06:05] zeljkof: well, okay, but I think we should've done the remaining ones because namespaceDupes.php is not very time-consuming [14:06:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364210 [14:06:09] TabbyCat: could you also please find out exactly how the script needs to be ran, for the remaining patches? that would speed up the deploy [14:06:15] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364210 [14:06:33] mwscript namespaceDupes.php --wiki= --fix IIRC [14:06:35] (03PS2) 10Faidon Liambotis: icinga: move RIPE Atlas measurements under netops [puppet] - 10https://gerrit.wikimedia.org/r/364208 [14:06:59] but you should check with Reedy or Dereckson since they've run several of those for me [14:07:04] TabbyCat: could you please add that to gerrit comments [14:07:13] yes, sure [14:07:18] screen mwscript updateCollation.php --wiki=frwiktionary --previous-collation=uppercase [14:07:19] anything you need [14:07:21] That looks correct [14:07:22] Run it [14:07:24] TabbyCat: And --add-prefix=something maybe. [14:07:25] Detach from the screen [14:07:27] (03CR) 10Andrew Bogott: [C: 032] Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870 (owner: 10Andrew Bogott) [14:07:32] Come back and check on it in a few hours [14:07:38] (03PS10) 10Andrew Bogott: Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870 [14:07:40] Reedy: ok, will do [14:07:56] Urbanecm: if we want to leave them with the Participation: string do I need to --add-prefix? [14:08:01] I don't really know [14:08:35] (03CR) 10Jcrespo: [C: 032] mariadb: Create /etc/mysql/mysqld.conf.d directory [puppet] - 10https://gerrit.wikimedia.org/r/364209 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [14:08:37] Depends if there's any conflicts [14:08:43] Usually run it without --fix [14:08:46] See what's gonna happen [14:08:58] You may or may not need a prefix/suffix [14:10:01] and !_log mwscript updateCollation.php --wiki=frwiktionary --previous-collation=uppercase ? [14:10:02] (03PS11) 10Andrew Bogott: Refactor puppetmaster roles into role/profiles [puppet] - 10https://gerrit.wikimedia.org/r/363870 [14:10:17] Reedy: TabbyCat I am running the script, will leave a comment at the patch later [14:10:23] !eu swat finished [14:10:45] TabbyCat: please mark the remaining patches at not deployed and/or move them to another swat window [14:10:49] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Cloud-VPS, and 2 others: rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3420744 (10chasemp) a:05chasemp>03Andrew ```[edit interfaces interface-range labs-instance-ports] member ge-5/0/3 { ... } + member ge-2/0/21; +... [14:11:11] * andrewbogott never not unsubscribed [14:11:51] !log EU SWAT finished [14:11:57] 10Operations, 10monitoring: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3420747 (10faidon) [14:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:10] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364210 (owner: 10Marostegui) [14:12:14] !log mwscript updateCollation.php --wiki=frwiktionary --previous-collation=uppercase is being running by zfilipin to finish T169810 [14:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:25] T169810: The French Wiktionary requires to set "$wgCategoryCollation" to "uca-default" - https://phabricator.wikimedia.org/T169810 [14:12:58] (03PS3) 10Elukey: role::piwik::server: add regular bacula backups [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) [14:13:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364210 (owner: 10Marostegui) [14:13:32] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364210 (owner: 10Marostegui) [14:13:50] !log reimaging mw2118 (video scaler) to jessie [14:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:04] zeljkof: moved to Morning SWAT [14:14:17] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1091 - T168661 (duration: 00m 42s) [14:14:21] TabbyCat: thanks, sorry for being slow, but I did not want to break things [14:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:28] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [14:14:45] zeljkof: no need to be sorry, I'd have done the same [14:14:50] better to be safe than sorry [14:15:11] (03CR) 10Alexandros Kosiaris: [C: 031] "I am wonder whether 14 is a good enough number but I guess we can re-evaluate down the road and amend if necessary. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/364195 (https://phabricator.wikimedia.org/T164073) (owner: 10Elukey) [14:16:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364212 (https://phabricator.wikimedia.org/T168661) [14:16:13] (03PS3) 10Andrew Bogott: Toolforge: Update motd banners for rebranding [puppet] - 10https://gerrit.wikimedia.org/r/364007 (https://phabricator.wikimedia.org/T168480) (owner: 10BryanDavis) [14:16:24] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mysql/grcat.config] [14:17:39] 10Operations, 10Discovery-Analysis: Upgrade pandoc package to at least 1.12.3 - https://phabricator.wikimedia.org/T168683#3420767 (10Ottomata) [14:17:44] RECOVERY - puppet last run on db1096 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [14:17:59] 10Operations, 10Discovery-Analysis: Upgrade pandoc package to at least 1.12.3 - https://phabricator.wikimedia.org/T168683#3372065 (10Ottomata) Will look into this as part of or after T152712 [14:18:40] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364212 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [14:19:52] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364212 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [14:20:01] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364212 (https://phabricator.wikimedia.org/T168661) (owner: 10Marostegui) [14:20:43] !log marostegui@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [14:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:58] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1059 - T168661 (duration: 00m 41s) [14:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:10] T168661: Apply schema change to add 3D filetype for STL files - https://phabricator.wikimedia.org/T168661 [14:23:24] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:23:34] !log Deploy alter table on db1059 - T168661 [14:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:12] zeljkof: how's the script going? :) [14:26:45] TabbyCat: didn't check yet, will do later [14:27:45] (03PS1) 10Andrew Bogott: Node defs for new labvirts: 1015-1018 [puppet] - 10https://gerrit.wikimedia.org/r/364213 (https://phabricator.wikimedia.org/T165531) [14:28:09] zeljkof: okay, we've just got questions from JackPotte because the categories there started to behave "bad". This is because they're being reordered IMHO. In case of further issues, I hope you and Reedy can look further into it. I'll too. [14:29:09] It's alsways gonna happen [14:29:18] (03PS2) 10MarcoAurelio: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) [14:29:36] Until we can support two (or more) collations simultaneously [14:30:18] no idea - I'm gonna try updating some logos and finish the tasks I've got assigned [14:30:56] Well, what do you expect? lol [14:31:04] MW is trying to sort based on half updated collation column [14:31:15] Some rows follow one rule, some follow another [14:31:34] But all trying to be sorted using the new collation [14:31:44] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:34:49] Reedy: tell Jack. I knew that'd happen heh [14:35:01] Just ignore him till it's finished [14:35:05] lol [14:35:12] will do [14:35:19] (03CR) 10Andrew Bogott: [C: 032] Node defs for new labvirts: 1015-1018 [puppet] - 10https://gerrit.wikimedia.org/r/364213 (https://phabricator.wikimedia.org/T165531) (owner: 10Andrew Bogott) [14:35:54] 10Operations, 10ops-codfw: mc2023 / mc2025 fail to mount root partition within 90 seconds using Linux 4.9 - https://phabricator.wikimedia.org/T170152#3420837 (10MoritzMuehlenhoff) [14:37:44] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [14:38:44] PROBLEM - puppet last run on labvirt1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:40:33] (03PS3) 10Elukey: redis::monitoring::nrpe_instance: set retry_interval to 2 mins [puppet] - 10https://gerrit.wikimedia.org/r/363791 [14:40:45] !log rebooting labvirt1015-1018 for kernel updates [14:40:51] volans: found the solution for the redis issue, --^ wasn't merged :D [14:40:55] my bad [14:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:14] elukey: rotfl [14:43:51] 10Operations, 10Beta-Cluster-Infrastructure, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Beta thumbnails are broken - https://phabricator.wikimedia.org/T169114#3420891 (10Gilles) I've added the Beta poolcounter to deployment-imagescaler01's config in horizon. Seems to work fine! [14:48:54] RECOVERY - puppet last run on labvirt1015 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [14:52:06] 10Operations, 10Mail: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#3420966 (10herron) Scratch that... We need to move from dmarcian to hosting reports ourselves after all. A pair of virtual machines have been provisioned (T1695... [14:54:03] 10Operations, 10DBA, 10Mail: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3420991 (10jcrespo) [14:56:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "First round of comments" (036 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363217 (owner: 10Giuseppe Lavagetto) [14:57:21] TabbyCat: Apologies for the late reply. I see you got the right script command. :) [15:00:01] !log installing apache security updates on app server canaries [15:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:21] (03PS1) 10Ottomata: Disable RCStream [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) [15:00:36] ottomata: \o/ \o/ \o/ [15:03:13] :) [15:06:19] (03PS2) 10Ottomata: Disable RCStream [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) [15:08:27] (03PS1) 10Giuseppe Lavagetto: recommendation api: refactor profile, remove module [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) [15:08:38] <_joe_> mobrovac: btw, ^^ [15:09:32] (03PS3) 10GWicke: PDF Render: Check hourly if the service is running via cron [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) [15:10:17] (03CR) 10Mobrovac: [C: 031] recommendation api: refactor profile, remove module [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) (owner: 10Giuseppe Lavagetto) [15:11:51] (03CR) 10GWicke: [C: 031] "We had one instance hang over the last two days, which eventually triggered alerts. I restarted the eqiad instances manually (which fixed " [puppet] - 10https://gerrit.wikimedia.org/r/359967 (https://phabricator.wikimedia.org/T159922) (owner: 10GWicke) [15:11:55] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531#3421069 (10Andrew) 05Open>03Resolved These are up and puppetized and running VMs. [15:12:45] (03PS3) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [15:13:39] PROBLEM - mediawiki-installation DSH group on mw2118 is CRITICAL: Host mw2118 is not in mediawiki-installation dsh group [15:14:27] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3421088 (10faidon) [15:14:46] !log Drop ukwikimedia_p views from labsdb hosts - T169488 [15:14:55] (03PS4) 10Alexandros Kosiaris: monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 (https://phabricator.wikimedia.org/T169321) [15:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:56] T169488: Drop ukwikimedia from labsdb hosts (was: ukwikimedia still present on replicas dbs on labs hosts) - https://phabricator.wikimedia.org/T169488 [15:15:09] (03CR) 10jerkins-bot: [V: 04-1] monitoring::host: Monitor IPMI as well if applicable [puppet] - 10https://gerrit.wikimedia.org/r/363295 (https://phabricator.wikimedia.org/T169321) (owner: 10Alexandros Kosiaris) [15:15:19] PROBLEM - Host mw2118 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:29] RECOVERY - Host mw2118 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:18:29] (03PS1) 10Andrew Bogott: Nova: Add labvirt1014 and 1015 to the scheduling pool. [puppet] - 10https://gerrit.wikimedia.org/r/364223 [15:18:36] (03PS4) 10Giuseppe Lavagetto: Add future parser run mode [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/363808 (https://phabricator.wikimedia.org/T169546) [15:18:40] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 6 minutes ago with 6 failures. Failed resources (up to 3 shown): File_line[login.defs-SYS_GID_MAX],File[/etc/firejail/mediawiki-converters.profile],Package[fonts-noto-cjk],Service[nutcracker] [15:19:39] PROBLEM - DPKG on mw2118 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:19:46] 10Operations, 10monitoring, 10Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#3421126 (10faidon) a:05akosiaris>03fgiunchedi [15:19:51] 10Operations, 10DC-Ops, 10monitoring, 10Patch-For-Review: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3421128 (10akosiaris) a:03akosiaris [15:21:39] RECOVERY - DPKG on mw2118 is OK: All packages OK [15:22:30] (03CR) 10Rush: [C: 031] "suuuurrreeeeeeeeeeeeeeeeeeeeeee" [puppet] - 10https://gerrit.wikimedia.org/r/364223 (owner: 10Andrew Bogott) [15:22:39] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [15:23:17] (03CR) 10Andrew Bogott: [C: 032] Nova: Add labvirt1014 and 1015 to the scheduling pool. [puppet] - 10https://gerrit.wikimedia.org/r/364223 (owner: 10Andrew Bogott) [15:23:51] (03CR) 10BBlack: [C: 04-1] "Can we split this into two separate changes for safety? The first being just the cache_misc disable of the service in hieradata/role/comm" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [15:24:19] 10Operations, 10DBA, 10Mail: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3421146 (10jcrespo) [15:24:28] 10Operations, 10DBA, 10Mail: Setup database for dmarc service - https://phabricator.wikimedia.org/T170158#3420991 (10jcrespo) [15:24:53] !log adding two new hosts (labvirt1014 and labvirt1015) to the nova-compute scheduling pool. Possible nodepool side-effects, maybe good ones? [15:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:40] (03CR) 10Ottomata: "Can do..." [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [15:27:23] 10Operations, 10DC-Ops, 10monitoring, 10Patch-For-Review: Monitor all management interfaces - https://phabricator.wikimedia.org/T169321#3421181 (10akosiaris) [15:27:25] 10Operations, 10Icinga, 10monitoring: Monitor all mgmt hosts - https://phabricator.wikimedia.org/T85143#3421183 (10akosiaris) [15:27:59] 10Operations, 10monitoring: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3421184 (10akosiaris) a:03akosiaris [15:29:00] (03PS1) 10MarcoAurelio: Logo updates for sr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) [15:30:02] zeljkof: How's the updateCollation script for T169810 going? Did it finish? [15:30:03] T169810: The French Wiktionary requires to set "$wgCategoryCollation" to "uca-default" - https://phabricator.wikimedia.org/T169810 [15:30:05] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3421194 (10Papaul) [15:30:41] Urbanecm: will check right now [15:30:48] Thank you [15:33:50] Urbanecm: sorry, meeting started, will try later [15:34:12] zeljkof: Ok. [15:35:28] !log milimetric@tin Started deploy [analytics/refinery@6da2774]: Update Sqoop fix python error [15:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:46] 10Operations, 10Icinga, 10monitoring: remove icinga monitoring for benefactorevents.wm.o SSL certificate - https://phabricator.wikimedia.org/T170139#3420428 (10Dzahn) This will be reverting T156850. [15:35:50] (03CR) 10Herron: [C: 032] Make mailman filter messages with high X-Spam-Score by default [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) (owner: 10Nemo bis) [15:37:34] !log milimetric@tin Finished deploy [analytics/refinery@6da2774]: Update Sqoop fix python error (duration: 02m 06s) [15:37:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:53] (03CR) 10MarcoAurelio: "Forgot to set the paths for the new HD logos. I'll amend this shortly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [15:38:19] PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=88%) [15:38:29] (03CR) 10Urbanecm: [C: 04-1] "HD logos aren't in InitialiseSettings.php. Please add them there." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [15:38:45] wtf is happening today everybody conflicting with me [15:39:15] (03PS4) 10Herron: Make mailman filter messages with high X-Spam-Score by default [puppet] - 10https://gerrit.wikimedia.org/r/350429 (https://phabricator.wikimedia.org/T161082) (owner: 10Nemo bis) [15:39:19] RECOVERY - Disk space on stat1002 is OK: DISK OK [15:41:02] scap + git-fat --^ [15:41:06] removed revs manually [15:42:16] !log milimetric@tin Started deploy [analytics/refinery@6da2774]: Update Sqoop fix python error [15:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:44] 10Operations, 10Traffic, 10Patch-For-Review: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#3421257 (10BBlack) Recap of recent progress: where we're at now is a hard cap of 1 day TTL within each cache layer, regardless of any longer max-age sent by the application layer. De... [15:43:04] (03PS2) 10MarcoAurelio: Logo updates for sr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) [15:43:52] !log milimetric@tin Finished deploy [analytics/refinery@6da2774]: Update Sqoop fix python error (duration: 01m 36s) [15:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:08] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [15:46:23] (03PS1) 10Dzahn: icinga: remove monitoring of benefactorevents.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/364231 (https://phabricator.wikimedia.org/T156850) [15:46:23] PROBLEM - Apache HTTP on mw2143 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:47:09] !log milimetric@tin Started deploy [analytics/refinery@6da2774]: Update Sqoop fix python error [15:47:13] RECOVERY - Apache HTTP on mw2143 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.119 second response time [15:47:16] !log milimetric@tin Finished deploy [analytics/refinery@6da2774]: Update Sqoop fix python error (duration: 00m 07s) [15:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Hm, we have a kind of a de facto standard to have a module per service we are running currently. Granted this creates many modules that ar" [puppet] - 10https://gerrit.wikimedia.org/r/364221 (https://phabricator.wikimedia.org/T148129) (owner: 10Giuseppe Lavagetto) [15:51:06] 10Operations, 10Traffic: Content purges are unreliable - https://phabricator.wikimedia.org/T133821#3421301 (10BBlack) I've updated over in T124954#3421257 on the TTL reduction work that has happened in recent months. The TL;DR there is that even in the case that no purging works at all and the application-spe... [15:55:57] !log elukey@tin Started deploy [analytics/refinery@6da2774]: Update stat1002 with the last refinery deployment [15:56:01] !log elukey@tin Finished deploy [analytics/refinery@6da2774]: Update stat1002 with the last refinery deployment (duration: 00m 04s) [15:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:30] 10Operations, 10Traffic, 10HTTPS, 10Patch-For-Review: Enforce HTTPS+HSTS on remaining one-off sites in wikimedia.org that don't use standard cache cluster termination - https://phabricator.wikimedia.org/T132521#3421337 (10BBlack) [15:57:33] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3421335 (10BBlack) 05Resolved>03Open All of these hostnames are still in DNS AFAICS: ``` benefactors.wikimedia.org benefactorevents.wikimedia.org eventdonations.wiki... [15:57:48] 10Operations, 10monitoring: Create instrumentation to monitor load on geoiplookup.wikimedia.org - https://phabricator.wikimedia.org/T104258#3421339 (10faidon) 05Open>03Resolved a:03faidon Long resolved, geoiplookup doesn't exist anymore (T100902). [16:04:51] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3421371 (10Papaul) Racking and cabling in progress [16:05:39] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#3421372 (10Dzahn) >>! In T156850#3420588, @Jgreen wrote: Trilogy shut down the site on Friday (fr-tech/fr-tech-ops didn't get advanced notice) and the ic... [16:10:27] Urbanecm: the script is still running, just checked [16:10:57] zeljkof: ok [16:11:25] 10Operations, 10Traffic: revoke benefactorevents.wikimedia.org SSL certificate - https://phabricator.wikimedia.org/T170140#3421416 (10Jgreen) [16:11:27] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3421415 (10Jgreen) [16:14:33] !log nuria@tin Started deploy [eventlogging/analytics@5e16da1]: (no justification provided) [16:14:37] !log nuria@tin Finished deploy [eventlogging/analytics@5e16da1]: (no justification provided) (duration: 00m 04s) [16:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:26] (03CR) 1020after4: [C: 031] deployment-prep: enable reusable TC on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/364148 (https://phabricator.wikimedia.org/T103886) (owner: 10Giuseppe Lavagetto) [16:22:52] (03PS1) 10MarcoAurelio: Logo and favicon changes for arbcom_dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) [16:23:13] 10Operations, 10Deployment-Systems, 10Performance-Team, 10HHVM, and 2 others: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#3421453 (10greg) [16:23:58] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: Icinga randomly forgets downtimes, causing alert and page spam - https://phabricator.wikimedia.org/T164206#3421458 (10akosiaris) Seems like this happened again on Fri Jul 7 - db1102 mysqld process - icinga lost downtime. I 'll search the debug logs... [16:26:23] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3421463 (10Jgreen) >>! In T137161#3421335, @BBlack wrote: > All of these hostnames are still in DNS AFAICS: > > ``` > benefactors.wikimedia.org > benefactorevents.wikime... [16:27:59] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3421466 (10BBlack) Update? [16:30:24] (03PS2) 10MarcoAurelio: Logo and favicon changes for arbcom_dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) [16:31:35] (03CR) 10MarcoAurelio: "All logos were optimized with optiPNG -o8." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) (owner: 10MarcoAurelio) [16:31:52] (03PS1) 10Marostegui: db-eqiad.php: Repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364240 (https://phabricator.wikimedia.org/T166204) [16:35:23] (03CR) 10MarcoAurelio: "Note to deployer according to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) (owner: 10MarcoAurelio) [16:35:28] (03PS1) 10Jdlrobson: Logo changes for various wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364241 (https://phabricator.wikimedia.org/T165896) [16:38:01] 10Operations, 10Goal, 10Kubernetes, 10User-Joe: Implement a pod networking policy approach - https://phabricator.wikimedia.org/T170111#3421503 (10Joe) [16:39:01] (03CR) 10MarcoAurelio: "Note to deployers as per " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [16:40:13] (03PS3) 10MarcoAurelio: Logo updates for sr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) [16:46:47] (03CR) 10Urbanecm: [C: 031] "@MarcoAurelio: I don't think it is really needed, it just works. It can be just synced." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [16:49:05] (03PS16) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [16:49:55] [16:57] Niharika TabbyCat: Apologies for the late reply. I see you got the right script command. <-- sorry for late replying, and thanks -- wanna review some easy patches I've got stalled in the queue or you too busy? [16:54:36] (03PS17) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [16:54:38] (03PS6) 10WMDE-leszek: mediawiki: Remove broken wikidata.org/ontology Apache alias [puppet] - 10https://gerrit.wikimedia.org/r/361801 (https://phabricator.wikimedia.org/T169023) (owner: 10Krinkle) [16:55:08] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#3421576 (10Jgreen) >>! In T156850#3421372, @Dzahn wrote: >>>! In T156850#3420588, @Jgreen wrote: >>Trilogy shut down the site on Friday (fr-tech/fr-tech-o... [16:57:25] TabbyCat: Sure thing, add me as a reviewer. [16:58:34] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364240 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [16:59:33] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364240 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [16:59:42] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364240 (https://phabricator.wikimedia.org/T166204) (owner: 10Marostegui) [17:00:04] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T1700). Please do the needful. [17:00:05] 10Operations, 10Performance, 10User-Elukey: Investigate a simplified replication model for the Redis Job Queues - https://phabricator.wikimedia.org/T164738#3421590 (10elukey) 05Open>03declined All these hosts will probably be deprecated once eventbus+changeprop will take over. [17:00:44] jouncebot: o/ [17:00:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1080 - T166204 (duration: 00m 42s) [17:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:59] T166204: Convert unique keys into primary keys for some wiki tables on s1 - https://phabricator.wikimedia.org/T166204 [17:01:45] 10Operations, 10Patch-For-Review, 10User-Elukey, 10Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3421595 (10elukey) @aaron any chance in your opinion that we could use nutcracker on t... [17:01:57] (03PS18) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [17:06:10] !log gehel@tin Started deploy [wdqs/wdqs@1b3b73e]: (no justification provided) [17:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:06] (03CR) 10Greg Grossmeier: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/364148 (https://phabricator.wikimedia.org/T103886) (owner: 10Giuseppe Lavagetto) [17:07:52] !log gehel@tin Finished deploy [wdqs/wdqs@1b3b73e]: (no justification provided) (duration: 01m 42s) [17:08:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:15] SMalyshev: deployment completed, tests are green [17:08:58] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: codfw: rack frack refresh equipment - https://phabricator.wikimedia.org/T169643#3421623 (10ayounsi) About naming, the "issue" is that the devices are stacked, so logically seen as a single device. It happen when two stacked/clustered devices are... [17:09:02] (03PS1) 10Marostegui: db1079.yaml: Specify ROW as binlog format [puppet] - 10https://gerrit.wikimedia.org/r/364247 (https://phabricator.wikimedia.org/T153743) [17:13:06] (03CR) 10Marostegui: "Puppet looks fine: https://puppet-compiler.wmflabs.org/6997/" [puppet] - 10https://gerrit.wikimedia.org/r/364247 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [17:20:01] (03PS1) 10Ottomata: Remove rcstream routes from varnish [puppet] - 10https://gerrit.wikimedia.org/r/364252 (https://phabricator.wikimedia.org/T170157) [17:20:05] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#3421698 (10Smalyshev) Implemented as `geof:globe`, `geof:latitude` & `geof:longitude` [17:20:22] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#3421704 (10Smalyshev) 05Open>03Resolved [17:22:09] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#3421729 (10Gehel) The UNKNOWN disappeared now that we are active/active. Previously, when no traffic was sent to codfw, we had no mean... [17:31:27] (03PS2) 10Ottomata: Remove rcstream routes from varnish [puppet] - 10https://gerrit.wikimedia.org/r/364252 (https://phabricator.wikimedia.org/T170157) [17:36:29] 10Operations, 10Traffic, 10Community-Liaisons (Jul-Sep 2017): Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3421794 (10BBlack) Sorry, we've been wanting to make forward progress on this for several months... [17:39:00] (03CR) 10BBlack: [C: 031] Remove rcstream routes from varnish [puppet] - 10https://gerrit.wikimedia.org/r/364252 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [17:39:15] ottomata: ^ [17:40:10] danke [17:40:22] i'm adding some routes to eventstreams to say something useful [17:40:25] at / and /rc [17:41:23] what was the script that updates the block rows and removes the expired ones? [17:41:34] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): CI for operations/puppet is taking too long - https://phabricator.wikimedia.org/T166888#3421841 (10greg) RelEng plans to work on the Docker Jenkins job until completion in the short term. This in... [17:45:02] (03PS2) 10Dzahn: admins: remove expiry_date for Justin Clark [puppet] - 10https://gerrit.wikimedia.org/r/363884 [17:45:46] (03Abandoned) 10Dzahn: admins: remove expiry_date for Justin Clark [puppet] - 10https://gerrit.wikimedia.org/r/363884 (owner: 10Dzahn) [17:50:51] (03PS1) 10Papaul: DNS: Add mgmt and production DNS for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364260 [17:51:16] (03CR) 10jerkins-bot: [V: 04-1] DNS: Add mgmt and production DNS for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364260 (owner: 10Papaul) [17:52:27] 10Operations, 10ops-codfw, 10monitoring, 10Patch-For-Review: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3421872 (10Papaul) [17:52:56] Niharika: done, thanks :) [17:52:57] !log otto@tin Started deploy [eventstreams/deploy@3d37f5d]: Redirect routes for RCStream deprecation [17:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] (03CR) 10Dzahn: [C: 04-1] "the section you are editing is only for ".m." mobile links, it's just a style thing but please don't mix" [dns] - 10https://gerrit.wikimedia.org/r/361295 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [17:55:36] !log disabling RCStream varnish routing: T170157 [17:55:38] !log otto@tin Finished deploy [eventstreams/deploy@3d37f5d]: Redirect routes for RCStream deprecation (duration: 02m 41s) [17:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:47] T170157: Disable RCStream - https://phabricator.wikimedia.org/T170157 [17:55:47] (03CR) 10Ottomata: [C: 032] Remove rcstream routes from varnish [puppet] - 10https://gerrit.wikimedia.org/r/364252 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [17:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:14] (03PS3) 10Dzahn: Add mai.wikimedia.org (Maithili user group) [dns] - 10https://gerrit.wikimedia.org/r/361295 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [17:57:56] (03PS4) 10Dzahn: Add mai.wikimedia.org (Maithili user group) [dns] - 10https://gerrit.wikimedia.org/r/361295 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T1800). [18:00:04] TabbyCat, Smalyshev, and Jdlrobson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:05] (03CR) 10Dzahn: [C: 032] Add mai.wikimedia.org (Maithili user group) [dns] - 10https://gerrit.wikimedia.org/r/361295 (https://phabricator.wikimedia.org/T168782) (owner: 10Urbanecm) [18:00:16] *here* [18:00:17] here [18:00:18] Hello. I can SWAT today. [18:00:22] yay [18:01:11] (03PS4) 10Niharika29: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) (owner: 10MarcoAurelio) [18:01:12] (03CR) 10Niharika29: [C: 032] Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) (owner: 10MarcoAurelio) [18:01:15] oh Niharika I've got two more patches to add [18:01:26] will add them at the bottom [18:02:18] TabbyCat: Go ahead. [18:03:01] (03CR) 10Dzahn: DNS: Add mgmt and production DNS for netmon2001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/364260 (owner: 10Papaul) [18:03:52] (03PS3) 10Niharika29: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:03:54] (03CR) 10Niharika29: [C: 032] Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:04:36] (03PS2) 10Dzahn: DNS: Add mgmt and production DNS for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364260 (owner: 10Papaul) [18:05:04] okay I'm back here [18:05:43] here [18:07:13] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [18:07:45] * Niharika kicks Zuul [18:08:18] * TabbyCat joins in kicking zuul -- "high prio" queue is a mockery [18:08:35] (03PS3) 10Ottomata: Decom RCStream [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) [18:08:58] Zuul is a troll, didn't you know? [18:09:15] leading to hell [18:09:18] xD [18:09:26] :P [18:09:45] operations-mw-config-composer-hhvm-jessie does not start [18:09:54] SMalyshev: Your patch has a -1 from Jenkins. Is that unrelated? [18:10:13] Aside, what's the gate-snad-submit-swat queue for? That's empty. [18:10:28] (03PS3) 10Dzahn: DNS: Add mgmt and production DNS for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364260 (owner: 10Papaul) [18:11:18] Niharika: it looks like a phan problem... original patch is fine and phan complaint is in the code that wasn't even changed [18:11:23] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3421970 (10faidon) [18:11:25] and seems to be wrong at that [18:11:25] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3421969 (10faidon) [18:11:28] Niharika: gate-and-submit-swat = patches against wmf/* branches that were +2ed [18:11:34] (03Merged) 10jenkins-bot: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) (owner: 10MarcoAurelio) [18:11:36] (03PS1) 10Ottomata: Stop writing mediawiki.revision-create events to EventLogging analytics MySQL [puppet] - 10https://gerrit.wikimedia.org/r/364262 (https://phabricator.wikimedia.org/T169898) [18:11:42] * TabbyCat claps [18:11:42] There's an outstanding phab task for including mw-config in that chain too [18:11:46] (03CR) 10jenkins-bot: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837) (owner: 10MarcoAurelio) [18:11:52] s/chain/queue [18:11:58] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#2775695 (10faidon) [18:12:00] 10Operations, 10ops-codfw: troubleshoot drac on ms-be2010.codfw.wmnet - https://phabricator.wikimedia.org/T155690#3421977 (10faidon) 05Open>03declined ms-be2010 is decom'ed now, resolving. [18:12:01] It's prioritized above the regular gate-and-submit queue [18:12:02] RoanKattouw: Ahh, okay. Would be nice to have that, yeah. [18:12:12] (03CR) 10Ottomata: "Routes were removed in https://gerrit.wikimedia.org/r/#/c/364252/, I'll leave the RCStream service running for a few days before fulling d" [puppet] - 10https://gerrit.wikimedia.org/r/364219 (https://phabricator.wikimedia.org/T170157) (owner: 10Ottomata) [18:12:59] Niharika: looks like new phan was deployed and it has some false positives... [18:13:38] TabbyCat: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - https://gerrit.wikimedia.org/r/363745 (https://phabricator.wikimedia.org/T61837 is up on mwdebug1002. [18:13:41] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt and production DNS for netmon2001 [dns] - 10https://gerrit.wikimedia.org/r/364260 (owner: 10Papaul) [18:13:45] If there's anything to test. [18:14:06] Niharika: don't think so, please deploy and run namespaceDupes [18:14:17] SMalyshev: Yeah, I've been having Phan issues in another extension too. MaxSem pushed a patch to fix it. [18:14:18] please keep them with the prefixes [18:14:40] linkie? [18:15:31] Niharika: looks like phan has trouble with use statements now :( but this is unrelated to the patch [18:15:56] !log niharika29@tin Started scap: wmf-config/InitialiseSettings.php Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - https://gerrit.wikimedia.org/r/363745 (T61837) [18:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:07] T61837: Remove Participation: and Programs: namespaces from metawiki - https://phabricator.wikimedia.org/T61837 [18:18:34] (03CR) 10Ottomata: [C: 032] Stop writing mediawiki.revision-create events to EventLogging analytics MySQL [puppet] - 10https://gerrit.wikimedia.org/r/364262 (https://phabricator.wikimedia.org/T169898) (owner: 10Ottomata) [18:20:36] (03PS1) 10Jdlrobson: Stop disabling MFTidyMobileViewSections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364264 (https://phabricator.wikimedia.org/T168671) [18:21:06] Niharika: ok, I think I figure it out. phan is not the fault here, it's another patch that is missing in wmf.7 - https://gerrit.wikimedia.org/r/#/c/363816/ - that's what phan complains about [18:22:16] !log niharika29@tin scap aborted: wmf-config/InitialiseSettings.php Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - https://gerrit.wikimedia.org/r/363745 (T61837) (duration: 06m 19s) [18:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:27] T61837: Remove Participation: and Programs: namespaces from metawiki - https://phabricator.wikimedia.org/T61837 [18:22:31] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#3422035 (10Dzahn) >>! In T156850#3421576, @Jgreen wrote: > I just assumed I didn't get notified because fr-tech-ops wasn't included in the notification co... [18:23:19] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Remove Programs and Participation namespaces from meta.wikimedia [mediawiki-config] - https://gerrit.wikimedia.org/r/363745 (T61837) (duration: 00m 42s) [18:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:37] (03PS4) 10Niharika29: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:25:10] (03PS1) 10Catrope: Enable experimental RCFilters live update feature in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364266 (https://phabricator.wikimedia.org/T167743) [18:25:41] (03PS4) 10Niharika29: Logo updates for sr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [18:25:46] (03CR) 10Niharika29: [C: 032] Logo updates for sr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [18:26:30] (03CR) 10Niharika29: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:26:32] (03CR) 10Niharika29: [C: 032] Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:27:10] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3422063 (10BBlack) +1 on using a similar rate to the APIs on text. I wonder what the peak (ab?)users' rates on `upload.wikimedia.org` look like as well, an... [18:27:13] PROBLEM - eventlogging_sync processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [18:27:15] (03Merged) 10jenkins-bot: Logo updates for sr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [18:27:24] (03CR) 10jenkins-bot: Logo updates for sr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364226 (https://phabricator.wikimedia.org/T168444) (owner: 10MarcoAurelio) [18:28:32] TabbyCat: Can you test the logo changes? [18:28:36] 10Operations, 10Traffic: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3422073 (10BBlack) benefactors - It wasn't originally part of the original task here, we've just been questioning whether it's also being removed at the same time, since... [18:28:46] (03CR) 10MarcoAurelio: "Please CR+2 again, thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:28:55] Niharika: sure [18:29:18] TabbyCat: The sr.wikiquote one is up. [18:30:20] Niharika: trying to see it on mwdebug1002 but idk if it is because of server caché that I cannot see the change [18:30:46] Niharika: now I do see it [18:30:50] and looks good [18:31:14] TabbyCat: Ack. [18:31:33] Niharika: 364197 needs cr+2 again [18:32:25] (03PS5) 10Niharika29: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:32:34] (03CR) 10Niharika29: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:32:37] (03CR) 10Niharika29: [C: 032] Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:32:48] !log niharika29@tin Synchronized static/images/project-logos/srwikiquote-1.5x.png: Logo updates for sr.wikiquote (T168444) (duration: 00m 41s) [18:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:59] T168444: Logo for sr.wikiquote.org - https://phabricator.wikimedia.org/T168444 [18:33:32] (03Merged) 10jenkins-bot: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:33:36] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3097215 (10faidon) FYI, db1071 is in a similar state, I'm not sure why. [18:33:41] (03CR) 10jenkins-bot: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364197 (https://phabricator.wikimedia.org/T170065) (owner: 10MarcoAurelio) [18:34:01] !log niharika29@tin Synchronized static/images/project-logos: Logo updates for sr.wikiquote (T168444) (duration: 00m 40s) [18:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:24] * MaxSem tried running cleanupTitles.php [18:34:25] page 3025511 (EPO/FSCONS_2011) doesn't match self. [18:34:25] DRY RUN: would rename 3025511 (204,'EPO/FSCONS_2011') to (0,'EPO/FSCONS_2011') [18:35:26] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:35:36] ... [18:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:42] 10Operations, 10Discovery, 10Maps, 10Traffic, and 2 others: What is a reasonable per-IP ratelimit for maps - https://phabricator.wikimedia.org/T169175#3422107 (10Pnorman) > I'm not sure I agree with this conclusion? A Pokemon Go fansite using our tiles making 349 req/sec would not be rate-limited? If we se... [18:35:45] Always happens to me. [18:36:11] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Logo updates for sr.wikiquote (T168444) (duration: 00m 40s) [18:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:33] (03CR) 10Nuria: "+1, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/364262 (https://phabricator.wikimedia.org/T169898) (owner: 10Ottomata) [18:37:19] TabbyCat: The srwikiquote logo is synced. [18:37:30] Niharika: thanks [18:38:10] TabbyCat: https://gerrit.wikimedia.org/r/#/c/364197/ is on mwdebug1002. [18:38:22] checking [18:38:51] Niharika: looks good to me [18:39:20] Syncing... [18:40:22] Niharika: so if we apply https://gerrit.wikimedia.org/r/#/c/364265/ that should fix phan complaints [18:40:27] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:00] SMalyshev: You wanna add this to the SWAT? [18:41:06] Add it to the calendar. [18:41:46] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:41:50] Ugh. [18:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:07] Why is scap failing so much today? [18:42:15] Fatalmonitor looks happy enough. [18:42:24] Niharika: ok, a sec [18:42:35] I am not missing another swat window [18:43:12] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:37] Niharika: done [18:44:13] RECOVERY - eventlogging_sync processes on db1047 is OK: PROCS OK: 1 process with UID = 0 (root), args /bin/bash /usr/local/bin/eventlogging_sync.sh [18:44:42] Come on scap. One more try. [18:45:20] !log niharika29@tin scap failed: average error rate on 1/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:49] c'mon :( [18:45:53] 10Operations: Look into feasibility of disabling sha-1 host keys on our ssh daemons - https://phabricator.wikimedia.org/T167966#3422178 (10ayounsi) Not sure if I'm hijacking the topic, but at least it's being tracked somewhere :) I've been playing with https://github.com/mozilla/ssh_scan on both servers and netw... [18:47:00] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Add wgMetaNamespace / wgMetaNamespaceTalk for lv.wiktionary [mediawiki-config] - https://gerrit.wikimedia.org/r/364197 (T170065) (duration: 00m 20s) [18:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:11] T170065: Rename Project and Project_talk namespaces on Latvian Wiktionary - https://phabricator.wikimedia.org/T170065 [18:47:15] TabbyCat: That one is now synced. [18:47:22] Had to do a force sync. :( [18:47:49] The most prominent error message on fatalmonitor is "LuaSandboxFunction::call(): recursion detected in /srv/mediawiki/php-1.30.0-wmf.7/extensions/Scribunto/engines/LuaSandbox/Engine.php on l [18:47:49] ine 312" [18:48:30] what has to do Scribunto with us? [18:49:02] hrm, maybe that check should only explode if more than 1 canary explodes? But how many...dunno. [18:49:07] Niharika: wanna me to fill a task and you continue with the SWAT? [18:49:48] TabbyCat: Yeah, I'm continuing with SWAT but that's probably the error message that's messing up with the canary traffic. File a bug if you can. :) [18:50:16] which host does it say is failing in all the scap dump? [18:50:24] sure, let's get this window complete, there are other fellow contributors waiting as well :) [18:50:33] we can run the maintenance scripts later as well [18:50:41] thcipriani: Different host every time. [18:51:01] TabbyCat: Yeah, can run the maintenance scripts afterwards. [18:51:24] (03PS3) 10Niharika29: Logo and favicon changes for arbcom_dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) (owner: 10MarcoAurelio) [18:51:28] (03CR) 10Niharika29: [C: 032] Logo and favicon changes for arbcom_dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) (owner: 10MarcoAurelio) [18:52:26] (03Merged) 10jenkins-bot: Logo and favicon changes for arbcom_dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) (owner: 10MarcoAurelio) [18:52:38] (03CR) 10jenkins-bot: Logo and favicon changes for arbcom_dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364237 (https://phabricator.wikimedia.org/T166947) (owner: 10MarcoAurelio) [18:53:44] TabbyCat: https://gerrit.wikimedia.org/r/#/c/364237/3 is on mwdebug1002 [18:53:57] Niharika: it's a private wiki so I cannot check much [18:54:04] logo in main page lgtm [18:54:36] favicon works too [18:54:44] so lgtm [18:54:54] a.k.a. looks good to me [18:55:43] TabbyCat: Syncing. [18:56:03] !log niharika29@tin scap failed: average error rate on 2/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [18:56:09] Niharika: if you get a chance, could you copy-paste the scap output from ^ [18:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:37] https://www.irccloud.com/pastebin/3LSspppL/scapsucks [18:56:41] thcipriani: ^ [18:56:53] Also 18:56:03 Check 'Logstash Error rate for mw1261.eqiad.wmnet' failed: ERROR: 78% OVER_THRESHOLD (Avg. Error rate: Before: 0.13, After: 6.00, Threshold: 1.28) [18:57:05] (03PS1) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [18:57:15] that's what I wanted [18:57:26] !log niharika29@tin Synchronized static/: Logo and favicon changes for arbcom_dewiki (T166947) (duration: 00m 20s) [18:57:31] I'm gonna go over the deployment window. Hopefully nobody minds. [18:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:37] T166947: Change logo and favicon for arbcom-de wiki - https://phabricator.wikimedia.org/T166947 [18:57:40] Niharika: https://gerrit.wikimedia.org/r/#/c/363996/ is fine now [18:58:02] (03CR) 10jerkins-bot: [V: 04-1] Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 (owner: 10Andrew Bogott) [18:58:35] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Logo and favicon changes for arbcom_dewiki (T166947) (duration: 00m 20s) [18:58:37] TabbyCat: All synced. [18:58:41] SMalyshev: Thank you! [18:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:46] hrm, I'm going to add an ignore in scap for that luasandbox extension thing. There's a task for it, but unfortunately there's not a lot we can do about it. Sorry for the noise Niharika [18:59:06] *not a lot we can do about it during deployments [19:00:24] thcipriani: Okay. I'll just have to force everything then. [19:02:07] Niharika: please don't get in that habit. I know it's annoying, but the check has saved me at least once from causing an outage. I'll have a patch to ignore this specific error message shortly. [19:02:38] (03PS2) 10Niharika29: Logo changes for various wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364241 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:02:46] (03CR) 10Niharika29: [C: 032] Logo changes for various wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364241 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:02:55] (03PS2) 10Niharika29: Stop disabling MFTidyMobileViewSections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364264 (https://phabricator.wikimedia.org/T168671) (owner: 10Jdlrobson) [19:02:59] (03CR) 10Niharika29: [C: 032] Stop disabling MFTidyMobileViewSections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364264 (https://phabricator.wikimedia.org/T168671) (owner: 10Jdlrobson) [19:03:18] thcipriani: I won't. :) [19:03:27] :) [19:03:38] (03Merged) 10jenkins-bot: Logo changes for various wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364241 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:03:53] (03Merged) 10jenkins-bot: Stop disabling MFTidyMobileViewSections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364264 (https://phabricator.wikimedia.org/T168671) (owner: 10Jdlrobson) [19:04:46] jdlrobson: Both your changes are on mwdebug1002. [19:05:23] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [19:06:07] (03CR) 10jenkins-bot: Logo changes for various wiki projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364241 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:06:23] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [19:07:21] !log niharika29@tin scap failed: average error rate on 2/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/3888cca979647b9381a7739b0bdbc88e for details) [19:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:41] thcipriani: 19:07:21 Check 'Logstash Error rate for mw1264.eqiad.wmnet' failed: ERROR: 64% OVER_THRESHOLD (Avg. Error rate: Before: 0.14, After: 4.00, Threshold: 1.43) [19:07:46] Niharika: thcipriani - https://phabricator.wikimedia.org/T170186 [19:07:53] !log niharika29@tin Synchronized static/images/mobile/: Logo changes for various wiki projects [mediawiki-config] - https://gerrit.wikimedia.org/r/364241 (https://phabricator.wikimedia.org/T165896) (duration: 00m 20s) [19:08:02] Thanks, TabbyCat. [19:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:09] T170186 [19:08:10] T170186: Repeat SCAP errors due to some Extension:Scribunto error - https://phabricator.wikimedia.org/T170186 [19:08:32] Niharika: https://gerrit.wikimedia.org/r/#/c/363996/ I need on terbium to test [19:09:37] SMalyshev: Yay, it merged finally. [19:09:39] !log niharika29@tin Synchronized wmf-config/: Stop disabling MFTidyMobileViewSections (T168671) and Logo changes for various wiki projects (T165896) (duration: 00m 21s) [19:09:47] Niharika: yep :) [19:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:53] T168671: Update usage of deprecated wgUseTidy - https://phabricator.wikimedia.org/T168671 [19:09:53] T165896: Logo for sr.m.wikipedia.org - https://phabricator.wikimedia.org/T165896 [19:10:04] jdlrobson: Both synced. [19:10:41] SMalyshev: It's on terbium now. [19:10:52] Niharika: thanks, checking [19:11:42] (03PS1) 10Thcipriani: scap: logstash_checker ignore for T166348 [puppet] - 10https://gerrit.wikimedia.org/r/364268 (https://phabricator.wikimedia.org/T170186) [19:11:53] (03PS1) 10Jdlrobson: Compress srlogo for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364269 (https://phabricator.wikimedia.org/T165896) [19:11:54] Niharika: seems to work fine, thanks! [19:12:02] (03PS1) 10Andrew Bogott: Move labs puppet enc into a profile [puppet] - 10https://gerrit.wikimedia.org/r/364270 [19:12:17] SMalyshev: Syncing then. [19:14:10] !log niharika29@tin Synchronized php-1.30.0-wmf.7/extensions/CirrusSearch/: Ignore archive records with null page_id (T169977) (duration: 00m 52s) [19:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:20] T169977: Bulk update failures: "id is missing" - https://phabricator.wikimedia.org/T169977 [19:14:25] Yay, this time I didn't have to force it. [19:14:33] (03CR) 10Bmansurov: [C: 031] Compress srlogo for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364269 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [19:15:12] Scap over for now except for a couple of maintenance scripts. [19:16:23] PROBLEM - Nginx local proxy to apache on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:13] RECOVERY - Nginx local proxy to apache on mw2128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.205 second response time [19:17:46] (03CR) 10Andrew Bogott: [C: 032] scap: logstash_checker ignore for T166348 [puppet] - 10https://gerrit.wikimedia.org/r/364268 (https://phabricator.wikimedia.org/T170186) (owner: 10Thcipriani) [19:18:04] andrewbogott: thanks! [19:18:52] (03CR) 10Andrew Bogott: [C: 032] Move labs puppet enc into a profile [puppet] - 10https://gerrit.wikimedia.org/r/364270 (owner: 10Andrew Bogott) [19:18:59] (03PS2) 10Andrew Bogott: Move labs puppet enc into a profile [puppet] - 10https://gerrit.wikimedia.org/r/364270 [19:19:24] TabbyCat: Did all the url purges for https://gerrit.wikimedia.org/r/#/c/364237/ and https://gerrit.wikimedia.org/r/#/c/364226/ [19:20:27] let me check [19:20:54] Niharika: yep [19:22:52] Niharika: https://gerrit.wikimedia.org/r/#/c/364268/ should temporarily resolve the rate of false positives you were seeing [19:23:20] thcipriani: \o/ [19:23:28] sorry for the extra noise :( [19:23:40] No worries. [19:24:32] (03CR) 10Dzahn: [C: 032] "note: this doesn't only remove the cert check, it removes the entire host as well" [puppet] - 10https://gerrit.wikimedia.org/r/364231 (https://phabricator.wikimedia.org/T156850) (owner: 10Dzahn) [19:26:45] Niharika: how's namespaceDupes going? [19:28:04] TabbyCat: From discussion offline and in another channel https://gerrit.wikimedia.org/r/#/c/363745/ does not need namespaceDupes to be run. It needs cleanupTitles, which MaxSem has volunteered to run on my behalf since I have a meeting right now. [19:28:32] oh Niharika thanks [19:28:40] how was the output MaxSem ? [19:28:43] any errors? [19:29:19] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#1839912 (10Izno) >>! In T119915#3421698, @Smalyshev wrote: > Implemented as `geof:globe`, `geof:latitude` & `geof:longitude` @smalysh... [19:29:22] https://gerrit.wikimedia.org/r/#/c/364197/ does need it I think [19:29:26] TabbyCat: He hasn't run it yet. He'll run it soon. [19:29:35] okay [19:29:41] Dereckson: ping [19:29:47] TabbyCat: Will run it for the other one. [19:29:58] thanks both :) [19:32:24] (03PS2) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [19:33:17] (03CR) 10jerkins-bot: [V: 04-1] Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 (owner: 10Andrew Bogott) [19:35:50] 10Operations, 10Goal, 10Kubernetes: Upgrade to kubernetes >=1.5 - https://phabricator.wikimedia.org/T170119#3419847 (10yuvipanda) Note that 1.7 landed https://kubernetes.io/docs/admin/extensible-admission-controllers/ which will allow us to remove all of our custom patches used in tools. [19:41:54] (03PS3) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [19:42:00] 10Operations, 10fundraising-tech-ops: remove eventdonations.wikimedia.org CNAME - https://phabricator.wikimedia.org/T170192#3422435 (10Jgreen) [19:42:05] TabbyCat: There were 9 pages I had to force rename with a prefix. You want me to share the list here or is it private? [19:42:16] Niharika: which wiki? [19:42:23] TabbyCat: Arbcom. [19:42:38] hmm arbcom... why arbcom? [19:42:48] (03CR) 10jerkins-bot: [V: 04-1] Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 (owner: 10Andrew Bogott) [19:42:54] TabbyCat: Oh sorry https://gerrit.wikimedia.org/r/#/c/364197/ [19:42:59] Lvwiktionary [19:43:07] I think we didn't had to run anything in arbcom Niharika -- if it is arbcom then it is private [19:43:22] TabbyCat: Yeah, I confused the two patches. Lvwiktionary. [19:43:23] if meta or lvwikt then sure, please add them to a Phab paste [19:43:32] so people can clear them [19:44:02] TabbyCat: Okay, I will add a comment to the ticket. [19:44:08] Niharika: thanks [19:44:17] meta still pending [19:44:44] TabbyCat: https://phabricator.wikimedia.org/T170065#3422453 [19:44:46] :) [19:47:59] got to go, please report the results for meta on the ticket and I'll look later [19:48:02] thanks! [19:48:17] !log Running cleanupTitles.php on Meta [19:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:01] (03PS4) 10Andrew Bogott: Rough in new labs puppetmaster roles [puppet] - 10https://gerrit.wikimedia.org/r/364267 [19:49:47] 10Operations, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3422468 (10Jgreen) [19:52:03] MaxSem: What was the script command you ran? [19:52:47] open screen session then mwscript cleanupTitles.php --wiki metawiki | tee somefile.log [19:53:06] then I'll grep through that file for the list of pages fixed [19:54:10] MaxSem: Ah. Nice. [19:54:35] well, and it blew up quickly [19:54:55] Better you than me. Blew up how though? [19:55:04] page 2428634 (Evaluations/WWHM) doesn't match self. [19:55:04] renaming 2428634 (209,'Evaluations/WWHM') to (0,'Evaluations/WWHM') [19:55:04] [28fd9d495307f57b3d639cb9] [no req] Wikimedia\Rdbms\DBQueryError from line 1145 of /srv/mediawiki-staging/php-1.30.0-wmf.7/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? [19:55:04] Query: UPDATE `page` SET page_namespace = '0',page_title = 'Evaluations/WWHM' WHERE page_id = '2428634' [19:55:18] pfft [19:56:49] Error: 1062 Duplicate entry '0-Evaluations/WWHM' for key 'name_title' [20:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T2000). Please do the needful. [20:01:44] 10Operations, 10Performance-Team, 10Thumbor: Implement poolcounter failover in Thumbor - https://phabricator.wikimedia.org/T169312#3422531 (10Gilles) [20:01:47] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Implement poolcounter failover in Thumbor - https://phabricator.wikimedia.org/T169312#3422532 (10Gilles) [20:02:36] 10Operations, 10Performance-Team, 10Thumbor: Investigate poolcounter failure leading to thumbor failing to generate thumbs - https://phabricator.wikimedia.org/T169313#3422547 (10Gilles) 05Open>03Resolved It seems to be that the issue simply came from the poolcounter being a valid IP but unreachable/the p... [20:02:39] 10Operations, 10Performance-Team, 10Thumbor, 10Patch-For-Review: Implement PoolCounter support in Thumbor - https://phabricator.wikimedia.org/T151066#3422549 (10Gilles) [20:06:23] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:09:08] (03PS1) 10Ottomata: Update otto's .iterm2_shell_integration.bash [puppet] - 10https://gerrit.wikimedia.org/r/364273 [20:10:06] (03CR) 10Ottomata: [C: 032] Update otto's .iterm2_shell_integration.bash [puppet] - 10https://gerrit.wikimedia.org/r/364273 (owner: 10Ottomata) [20:13:26] (03CR) 10Zfilipin: "The script has finished:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363740 (https://phabricator.wikimedia.org/T169810) (owner: 10MarcoAurelio) [20:16:07] (03PS3) 10D3r1ck01: Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) [20:28:56] (03CR) 10Dzahn: [C: 04-1] Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [20:31:00] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: revoke eventdonations.wikimedia.org SSL cert if there is one... - https://phabricator.wikimedia.org/T170193#3422630 (10Peachey88) [20:31:11] 10Operations, 10Domains, 10Traffic, 10fundraising-tech-ops: remove eventdonations.wikimedia.org CNAME - https://phabricator.wikimedia.org/T170192#3422631 (10Peachey88) [20:31:52] (03PS19) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [20:32:13] (03CR) 10Paladox: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [20:34:21] (03CR) 10Jgreen: [C: 031] icinga: remove monitoring of benefactorevents.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/364231 (https://phabricator.wikimedia.org/T156850) (owner: 10Dzahn) [20:35:33] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:36:39] (03PS2) 10Dzahn: icinga: remove monitoring of benefactorevents.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/364231 (https://phabricator.wikimedia.org/T156850) [20:37:54] (03PS3) 10Dzahn: icinga: remove monitoring of benefactorevents.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/364231 (https://phabricator.wikimedia.org/T156850) [20:42:24] (03CR) 10Thcipriani: [C: 04-1] Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [20:50:06] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#3422728 (10Smalyshev) @Izno yes, pasted in a wrong window :) [20:53:52] (03Draft1) 10Paladox: Symlink review_site folder inside deployment to /var/lib/gerrit2/review_site [software/gerrit] - 10https://gerrit.wikimedia.org/r/364320 [20:53:54] (03PS2) 10Paladox: Symlink review_site folder inside deployment to /var/lib/gerrit2/review_site [software/gerrit] - 10https://gerrit.wikimedia.org/r/364320 [20:55:18] (03PS20) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [20:55:53] (03CR) 10Paladox: Gerrit: Add support for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) (owner: 10Paladox) [20:59:43] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [21:00:04] dapatrick, bawolff, and Reedy: Dear anthropoid, the time has come. Please deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T2100). [21:00:41] mediawikiwiki seems down or very laggy [21:00:56] ERR_SPDY_PROTOCOL_ERROR [21:01:02] mediawiki.org? [21:01:04] works for me. [21:01:58] I'm trying to put {{TNT|Outdated}} into https://www.mediawiki.org/w/index.php?title=Manual:CleanupTitles.php [21:02:03] hit "show preview" [21:02:07] loads indefinitely [21:02:11] then crashes [21:02:43] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [21:02:49] i get this in the console [21:02:50] Failed to set referrer policy: The value 'origin-when-cross-origin' is not one of 'no-referrer', 'origin', 'no-referrer-when-downgrade', or 'unsafe-url'. Defaulting to 'no-referrer'. [21:02:57] but no crashes and show preview works. [21:03:05] not for me :( [21:03:37] hmm, SPDY_PROTOCOL in there [21:03:56] * TabbyCat googles [21:04:38] makes me think it only happens if browser supports that [21:04:44] doesnt see the error [21:05:30] what browser do you use? [21:05:43] i think that protocole is being discontinued and replaced with http/2 [21:05:50] (03CR) 10Thcipriani: Symlink review_site folder inside deployment to /var/lib/gerrit2/review_site (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/364320 (owner: 10Paladox) [21:06:04] chrome [21:06:12] (03PS3) 10Paladox: Symlink review_site folder inside deployment to /var/lib/gerrit2/review_site [software/gerrit] - 10https://gerrit.wikimedia.org/r/364320 [21:06:14] it worked less than an hour ago [21:06:26] (03CR) 10Paladox: Symlink review_site folder inside deployment to /var/lib/gerrit2/review_site (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/364320 (owner: 10Paladox) [21:06:42] what version? [21:06:54] !log running IPMI auditing to update status of T150160 [21:06:57] TabbyCat: wanna try if it changes anything to disable it? [21:06:59] "SPDY protocol functionality can be (de)activated by toggling "Enable SPDY/4" setting on local chrome://flags page. " [21:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:07] T150160: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160 [21:07:43] got that from https://en.wikipedia.org/wiki/SPDY#Client_.28browser.29_support_and_usage [21:08:51] I think I'll just restart the browser and if it still happens I'll do a full computer restart [21:09:21] TabbyCat: try restarting chrome, i've found when you had long running sessions it can go hinky sometimes [21:09:41] espically if chrome is waiting to load updates or whatever on restart [21:09:51] just doing :) [21:11:29] nah, doesn't work [21:13:23] might be this new preview mode? [21:13:43] can't preview on meta -- ajax loading icon loads indefinitely [21:13:52] An error occurred while attempting to preview your changes. [21:13:54] HTTP error: timeout [21:14:00] s-t [21:15:12] (03CR) 10Thcipriani: [C: 04-1] "big long explanation inline." (031 comment) [software/gerrit] - 10https://gerrit.wikimedia.org/r/364320 (owner: 10Paladox) [21:17:33] MaxSem: Did the script finish? [21:17:48] it crashed, see above [21:23:16] POST https://www.mediawiki.org/w/index.php?title=Manual:CleanupTitles.php&action=submit net::ERR_SPDY_PROTOCOL_ERROR [21:23:24] sigh [21:25:24] will restart computer, the noob trick usually works [21:26:40] (03Abandoned) 10Paladox: Symlink review_site folder inside deployment to /var/lib/gerrit2/review_site [software/gerrit] - 10https://gerrit.wikimedia.org/r/364320 (owner: 10Paladox) [21:32:53] (03CR) 10jenkins-bot: Stop disabling MFTidyMobileViewSections [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364264 (https://phabricator.wikimedia.org/T168671) (owner: 10Jdlrobson) [21:33:08] works now [21:33:40] (03PS5) 10Paladox: Gerrit: Add gerrit pub key for ssh [labs/private] - 10https://gerrit.wikimedia.org/r/363755 [21:36:09] (03PS21) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [21:37:20] (03PS23) 10Paladox: Gerrit: Add support for scap [puppet] - 10https://gerrit.wikimedia.org/r/363726 (https://phabricator.wikimedia.org/T157414) [21:58:43] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:00:59] !log restart varnish backend on cp1099 (mailbox lag) [22:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:43] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:02:13] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 0 [22:08:07] !log reedy@tin Synchronized php-1.30.0-wmf.7/includes/: (no justification provided) (duration: 01m 33s) [22:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:33] (03PS7) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:11:36] (03PS6) 10Paladox: Gerrit: Upgrading gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [22:12:43] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:13:00] !log Re-ran cleanupTitles.php on Meta with live fix applied, works now (ref T61837) [22:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:11] T61837: Remove Participation: and Programs: namespaces from metawiki - https://phabricator.wikimedia.org/T61837 [22:14:28] hmm :) [22:15:49] (03PS8) 10Paladox: gerrit: DO NOT MERGE [software/gerrit] - 10https://gerrit.wikimedia.org/r/363738 [22:15:55] (03PS7) 10Paladox: Gerrit: Upgrading gerrit to 2.14.2-pre (DO NOT MERGE) [software/gerrit] - 10https://gerrit.wikimedia.org/r/363734 [22:23:49] mutante: You have a sec to help me get a new developer up and running? [22:26:34] nevermind, I think we're in good shape [22:28:06] !log bawolff@tin Synchronized php-1.30.0-wmf.7/extensions/CentralAuth/includes/specials/SpecialCentralAutoLogin.php: T134931 (duration: 00m 44s) [22:28:15] !log reloaded apache2 config on iridium to activate the changes from https://gerrit.wikimedia.org/r/#/c/363356/ [22:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:42] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423075 (10faidon) [22:32:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Reset db1070 idrac - https://phabricator.wikimedia.org/T160392#3423073 (10faidon) 05Open>03Resolved OK, so I noticed that the `Error: Unable to establish IPMI v2 / RMCP+ session` response was immediate, like the password was wrong. So I tried changi... [22:36:11] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3395803 (10Volans) Regarding `sodium` it seems to me that puppet runs get stuck because it try to execute `ipmi-config` while loading facts: ``` 143053 pts/0 S+ 0:00... [22:36:31] 10Operations, 10DBA: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3423109 (10faidon) Same issue as T160392. From the iDRAC web interface, I set the password to something random then back to our password and this seems to have done the trick. [22:36:37] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423113 (10faidon) [22:36:39] 10Operations, 10DBA: dbstore1001 troubleshoot IPMI issue - https://phabricator.wikimedia.org/T158893#3423111 (10faidon) 05Open>03Resolved [22:38:21] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423123 (10Volans) I've run the audit again with a small script on `neodymium` in my home using `cumin` to grab the list of hostnames. The only requirement is to have exported `IPMI_PASSWORD` w... [22:44:48] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423143 (10Volans) I've try to fix the 5 hosts with the remote wrong remote config: ``` sudo cumin -b 1 "bast3002.wikimedia.org,cp4021.ulsfo.wmnet,db2082.codfw.wmnet,gerrit2001.wikimedia.org,na... [22:48:27] kaldari: back, what do you need [22:50:59] ok, saw the second line later :) [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170710T2300). Please do the needful. [23:00:04] Jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:13] here here here [23:00:42] I added one too. [23:01:25] I can SWAT [23:02:33] (03CR) 10Thcipriani: [C: 032] Compress srlogo for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364269 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [23:03:28] (03Merged) 10jenkins-bot: Compress srlogo for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364269 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [23:03:40] (03CR) 10jenkins-bot: Compress srlogo for Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364269 (https://phabricator.wikimedia.org/T165896) (owner: 10Jdlrobson) [23:04:42] jdlrobson: live on mwdebug1002, check please [23:04:53] thcipriani: checking [23:06:36] thcipriani: sync please! [23:06:56] jdlrobson: going [23:08:42] !log thcipriani@tin Synchronized static/images/mobile/copyright/wikipedia-wordmark-sr.svg: SWAT [[gerrit:364269|Compress srlogo for Wikipedia]] T165896 (duration: 00m 43s) [23:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:56] T165896: Logo for sr.m.wikipedia.org - https://phabricator.wikimedia.org/T165896 [23:09:21] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3423200 (10mobrovac) [23:09:48] ^ jdlrobson live now [23:09:54] woot [23:10:42] 10Operations, 10Epic, 10Goal, 10Services (doing): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3423216 (10mobrovac) [23:10:53] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:10:53] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:11:40] thanks thcipriani [23:12:43] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:13:22] anyone know what this change is to maintenance/cleanupTitles.php? [23:13:54] (03PS1) 10Faidon Liambotis: Fix stat1003's mgmt IP [dns] - 10https://gerrit.wikimedia.org/r/364338 [23:14:11] (03CR) 10Faidon Liambotis: [C: 032] Fix stat1003's mgmt IP [dns] - 10https://gerrit.wikimedia.org/r/364338 (owner: 10Faidon Liambotis) [23:14:33] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [23:15:17] well. Seems to only exist on tin afaict, I'm going to revert it. [23:15:45] thcipriani: I think MaxSem was messing with that earlier [23:15:50] 10Operations, 10AbuseFilter, 10Traffic, 10Zero: user_wpzero doesn't always work - https://phabricator.wikimedia.org/T169907#3423257 (10Legoktm) The code is as simple as: ```lang=php $vars->setVar( 'user_wpzero', $wgRequest->getHeader( 'X-Carrier' ) !== false ); ``` So maybe that header isn't alw... [23:15:59] yep [23:16:23] MaxSem: blerg, I checked out the file already, but the change is in my backscroll if you need it for any reason [23:16:38] nah, I already committed it:) [23:16:48] cool :) [23:17:42] matt_flaschen: flow change live on mwdebug1002, check please [23:18:33] thcipriani, will do. Can I add a late-breaking change as well? [23:18:48] oh boy [23:18:51] sure [23:18:53] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:19:33] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:19:53] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:20:07] (03CR) 10Paladox: [C: 031] Remove 'din' from wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/362876 (https://phabricator.wikimedia.org/T168523) (owner: 10D3r1ck01) [23:20:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:23:39] thcipriani, the one on mwdebug1002 is tested and good to go. [23:23:46] matt_flaschen: ok, going live [23:24:14] thcipriani, never mind on the late-breaking one. [23:24:23] Thanks for your flexibility though. [23:25:09] sure :) [23:25:45] !log thcipriani@tin Synchronized php-1.30.0-wmf.7/extensions/Flow/Hooks.php: SWAT: [[gerrit:364337|Do not override other flags on enhanced recent changes]] T169181 (duration: 00m 42s) [23:25:53] ^ matt_flaschen flow change live [23:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:56] T169181: Edits to Flow are marked as Wikidata edits in enhanced RC - https://phabricator.wikimedia.org/T169181 [23:27:07] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423316 (10faidon) So I did the following: - mw1302: had `Volatile_Channel_Privilege_Limit` and `Non_Volatile_Channel_Privilege_Limit` set to `Operator` instead of `Administrator`; fixed with b... [23:33:57] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423327 (10faidon) [23:37:23] PROBLEM - Apache HTTP on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:13] RECOVERY - Apache HTTP on mw2130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.141 second response time [23:38:54] 10Operations, 10monitoring: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160#3423333 (10faidon) [23:41:10] (03PS1) 10Faidon Liambotis: ipmi: add a 5 second timeout to fact [puppet] - 10https://gerrit.wikimedia.org/r/364341 [23:41:40] (03PS2) 10Faidon Liambotis: ipmi: add a 5 second timeout to the ipmi_lan fact [puppet] - 10https://gerrit.wikimedia.org/r/364341 [23:42:25] (03CR) 10Faidon Liambotis: [C: 032] ipmi: add a 5 second timeout to the ipmi_lan fact [puppet] - 10https://gerrit.wikimedia.org/r/364341 (owner: 10Faidon Liambotis) [23:43:21] 10Operations, 10AbuseFilter, 10Traffic, 10Zero: user_wpzero doesn't always work - https://phabricator.wikimedia.org/T169907#3423355 (10Legoktm) 05Open>03Invalid I looked up the IPs the users used in the database and according to @bawolff, they're not Wikipedia Zero IP's. [23:45:53] PROBLEM - puppet last run on sodium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:46:26] (03PS1) 10Faidon Liambotis: Revert "ipmi: add a 5 second timeout to the ipmi_lan fact" [puppet] - 10https://gerrit.wikimedia.org/r/364343 [23:46:36] (03CR) 10Faidon Liambotis: [V: 032 C: 032] Revert "ipmi: add a 5 second timeout to the ipmi_lan fact" [puppet] - 10https://gerrit.wikimedia.org/r/364343 (owner: 10Faidon Liambotis) [23:54:34] 10Operations, 10ops-codfw, 10ops-eqiad, 10monitoring: Unresponsive/misconfigured iDRACs - https://phabricator.wikimedia.org/T169360#3423369 (10Dzahn) [23:57:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [23:58:43] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]