[00:57:20] Hey all, I have a question related to Graphana dashboards. Looks like https://grafana.wikimedia.org/dashboard/db/mobile-2g?orgId=1&panelId=32&fullscreen stopped working on 2017-08-08, is there anything I'm not familiar of? [00:57:31] or is it just dates on the graphs are incorrect? [01:21:05] 10Operations, 10Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941#3540046 (10Tbayer) CCing @diego and @RobH who (judging from [[http://wm-bot.wmflabs.org/logs/%23wikimedia-analytics/20170818.txt |IRC scrollback]]) grappled quite a bit too with... [02:23:52] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.14) (duration: 07m 11s) [02:24:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:47] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Aug 22 02:30:47 UTC 2017 (duration 6m 55s) [02:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:29:08] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 654.31 seconds [03:30:38] PROBLEM - Check Varnish expiry mailbox lag on cp4015 is CRITICAL: CRITICAL: expiry mailbox lag is 2055705 [03:42:34] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3540153 (10Papaul) I will be receiving a replacement main board for this system this Wednesday. Your Service Request SR#: 952124690 Contact Us | Support Library | Download Center | SupportAssist | Community Forums... [04:12:38] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 255.21 seconds [05:22:48] (03CR) 10Phedenskog: [C: 031] "@Ottomata yes please" [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [05:28:57] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 32 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [05:33:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 7 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [06:44:50] (03PS1) 10Marostegui: s4.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/372956 (https://phabricator.wikimedia.org/T172996) [06:47:59] (03CR) 10Marostegui: [C: 032] s4.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/372956 (https://phabricator.wikimedia.org/T172996) (owner: 10Marostegui) [06:48:44] (03Merged) 10jenkins-bot: s4.hosts: Add db1102 [software] - 10https://gerrit.wikimedia.org/r/372956 (https://phabricator.wikimedia.org/T172996) (owner: 10Marostegui) [06:54:34] !log installing graphite2 (the image library, not the metrics tool) security updates on trusty (Debian already fixed) [06:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:07] !log installing c-ares security updates on trusty (Debian already fixed) [07:11:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373016 (https://phabricator.wikimedia.org/T172996) [07:26:57] (03PS2) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373016 (https://phabricator.wikimedia.org/T172996) [07:28:08] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 34282.20 seconds [07:28:17] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag could not connect [07:28:28] PROBLEM - MariaDB Slave SQL: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_state could not connect [07:28:33] !log installing ruby security updates on trusty (Debian already fixed) [07:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:43] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373016 (https://phabricator.wikimedia.org/T172996) (owner: 10Marostegui) [07:28:47] PROBLEM - mysqld processes on dbstore2001 is CRITICAL: PROCS CRITICAL: 7 processes with command name mysqld [07:28:57] PROBLEM - MariaDB Slave IO: s4 on dbstore2001 is CRITICAL: CRITICAL slave_io_state could not connect [07:29:10] jynus: ^ that is you? [07:29:23] downtimes expired [07:29:28] aaaah [07:29:29] right [07:29:37] but the problem is that s7 cannot recover [07:29:44] even with s4 stopped [07:30:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373016 (https://phabricator.wikimedia.org/T172996) (owner: 10Marostegui) [07:30:35] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373016 (https://phabricator.wikimedia.org/T172996) (owner: 10Marostegui) [07:30:46] should I delete s4? [07:31:24] Oh :( [07:31:34] Yeah, let's get rid of it, at least it is on dbstore2002 [07:32:14] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1064 - T172996 (duration: 00m 52s) [07:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:25] T172996: Migrate s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996 [07:34:15] !log Stop replication on db1064 sanitarium2 and sanitarium3 master to move labsdb1009,10 and 11 s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996 [07:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:40] (03PS2) 10Jcrespo: mariadb: Remove custom salt grains due to salt deprecation [puppet] - 10https://gerrit.wikimedia.org/r/370993 (https://phabricator.wikimedia.org/T164780) [07:59:17] (03CR) 10Jcrespo: [C: 032] mariadb: Remove custom salt grains due to salt deprecation [puppet] - 10https://gerrit.wikimedia.org/r/370993 (https://phabricator.wikimedia.org/T164780) (owner: 10Jcrespo) [08:00:07] 10Operations, 10ops-eqiad: Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3540322 (10Volans) p:05Triage>03Normal [08:01:04] 10Operations, 10ops-eqiad, 10Analytics-Kanban: Degraded RAID on analytics1055 - https://phabricator.wikimedia.org/T172809#3540324 (10Volans) p:05Triage>03Normal [08:10:30] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3540325 (10MoritzMuehlenhoff) The migrated canary servers are looking fine. I was initially irritated by https://phabricator.wikimedia.org/T173705, but that also occurs indep... [08:13:47] (03PS2) 10Alexandros Kosiaris: ores: More configs for stress testing [puppet] - 10https://gerrit.wikimedia.org/r/372866 (https://phabricator.wikimedia.org/T169246) (owner: 10Ladsgroup) [08:13:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ores: More configs for stress testing [puppet] - 10https://gerrit.wikimedia.org/r/372866 (https://phabricator.wikimedia.org/T169246) (owner: 10Ladsgroup) [08:23:56] !log upgrading hhvm-luasandbox on deployment servers / script runners [08:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:53] !log Drop s4 from db1095 with NO replication - T172996 [08:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:04] T172996: Migrate s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996 [08:28:28] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [08:32:34] looks like cp4015 is in trouble with its mailbox, I'll bounce varnish [08:32:42] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-bel] - 10https://gerrit.wikimedia.org/r/372227 (https://phabricator.wikimedia.org/T172381) (owner: 10KartikMistry) [08:32:50] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-rus] - 10https://gerrit.wikimedia.org/r/372230 (https://phabricator.wikimedia.org/T172381) (owner: 10KartikMistry) [08:33:09] !log bounce varnish on cp4015 - mailbox problems [08:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:36] and cp1072 too [08:35:06] !log bounce varnish on cp1072 - mailbox problems [08:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:38] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3540348 (10akosiaris) I think we should clarify a bit better what we want to do so that we all are on the same page. So here's a couple of questions to help with that. * Do we... [08:37:21] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-bel: Initial Debian packaging [debs/contenttranslation/apertium-bel] - 10https://gerrit.wikimedia.org/r/372227 (https://phabricator.wikimedia.org/T172381) (owner: 10KartikMistry) [08:37:40] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-rus: Initial Debian packaging [debs/contenttranslation/apertium-rus] - 10https://gerrit.wikimedia.org/r/372230 (https://phabricator.wikimedia.org/T172381) (owner: 10KartikMistry) [08:39:08] (03PS1) 10Filippo Giunchedi: statsite: don't track statsd udp traffic [puppet] - 10https://gerrit.wikimedia.org/r/373032 (https://phabricator.wikimedia.org/T173731) [08:40:00] akosiaris: feel free to upload and recheck apertium-bel-rus when you've time :) I'll be back after sometime. [08:40:09] οκ [08:40:35] I'll update puppet config today too. [08:40:47] RECOVERY - Check Varnish expiry mailbox lag on cp4015 is OK: OK: expiry mailbox lag is 0 [08:41:27] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:42:07] the 500s have the recovered at 8:36 btw, the alerts will recover too at some point [08:42:27] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [08:43:25] !log upload apertium-bel_0.1.0~r81357-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [08:43:25] !log upload apertium-rus_0.1.0~r81184-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [08:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:55] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-bel-rus] - 10https://gerrit.wikimedia.org/r/372341 (https://phabricator.wikimedia.org/T172381) (owner: 10KartikMistry) [08:44:07] (03PS2) 10Alexandros Kosiaris: Introduce ganeti100[789].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/372857 (https://phabricator.wikimedia.org/T173565) [08:44:21] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce ganeti100[789].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/372857 (https://phabricator.wikimedia.org/T173565) (owner: 10Alexandros Kosiaris) [08:47:38] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:53:15] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373033 [08:54:03] 10Operations, 10Wikimedia-Logstash, 10vm-requests, 10Discovery-Search (Current work), 10Patch-For-Review: Provision VMs on Ganeti for logstash100[123] - https://phabricator.wikimedia.org/T173565#3540370 (10akosiaris) @RobH, I 've just done the DNS part and then remembered that on T173298#3531206 you 've... [08:54:25] 10Operations, 10Wikimedia-Logstash, 10vm-requests, 10Discovery-Search (Current work), 10Patch-For-Review: Provision VMs on Ganeti for logstash100[123] - https://phabricator.wikimedia.org/T173565#3540387 (10akosiaris) p:05Triage>03Normal a:05akosiaris>03RobH [08:55:45] !log upgrading hhvm-luasandbox on mw1161-mw1167 (job runners) [08:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:11] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-bel-rus: Initial Debian packaging [debs/contenttranslation/apertium-bel-rus] - 10https://gerrit.wikimedia.org/r/372341 (https://phabricator.wikimedia.org/T172381) (owner: 10KartikMistry) [08:56:27] PROBLEM - HHVM jobrunner on mw1162 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [08:56:55] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3540394 (10Marostegui) Maybe we should consider this fixed? it has not happened again since Thursda... [08:57:27] RECOVERY - HHVM jobrunner on mw1162 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [08:58:14] 10Operations, 10Mail, 10OTRS, 10Patch-For-Review: Automatically merge bounces/DSNs in ticket - https://phabricator.wikimedia.org/T173733#3540395 (10akosiaris) [08:58:40] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3540397 (10Ladsgroup) 05Open>03Resolved [08:58:58] 10Operations, 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, and 5 others: Cache invalidations coming from the JobQueue are causing lag on several wikis - https://phabricator.wikimedia.org/T164173#3276584 (10Ladsgroup) Thanks! Feel free to reopen in case it started to happen again. [08:59:41] !log upload apertium-bel-rus_0.2.0~r81186-1+wmf to apt.wikimedia.org/jessie-wikimedia/main [08:59:44] kart_: done [08:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:12] 10Operations, 10DBA, 10Wikidata, 10Wikidata.org: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3540402 (10Marostegui) This has not happened since T164173 got fixed I believe, so maybe it was indeed a direct cause. [09:03:29] 10Operations, 10DBA, 10Wikidata, 10Wikidata.org: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3540404 (10jcrespo) 05Open>03Resolved a:03jcrespo Resolving for now. [09:04:01] 10Operations, 10DBA, 10Wikidata, 10Wikidata.org: Wikidata.org currently very slow - https://phabricator.wikimedia.org/T173269#3540408 (10jcrespo) a:05jcrespo>03daniel [09:06:51] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3540414 (10Volans) My answers to the above questions are: YES, YES, YES (but I'd like them to be separated in the UI, unfortunately this is not possible in Icinga), NO For dec... [09:12:49] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#2823217 (10jcrespo) Question- could a similar solution be applied to "in installation" hosts? Databases can take a day or more to provision, but they need the full role to be a... [09:17:48] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/7558" [puppet] - 10https://gerrit.wikimedia.org/r/373032 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [09:17:49] (03CR) 10Filippo Giunchedi: [C: 032] statsite: don't track statsd udp traffic [puppet] - 10https://gerrit.wikimedia.org/r/373032 (https://phabricator.wikimedia.org/T173731) (owner: 10Filippo Giunchedi) [09:17:56] (03PS2) 10Filippo Giunchedi: statsite: don't track statsd udp traffic [puppet] - 10https://gerrit.wikimedia.org/r/373032 (https://phabricator.wikimedia.org/T173731) [09:18:17] RECOVERY - mysqld processes on dbstore2001 is OK: PROCS OK: 8 processes with command name mysqld [09:20:23] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3540455 (10akosiaris) @jcrespo, assuming I have understood correctly what you want, yes I think so. [09:23:22] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3540457 (10Volans) I think so too but it might need some parameter or hiera value to define those as "provisioning", given that they will have already the production MariaDB ro... [09:32:22] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373033 (owner: 10Marostegui) [09:33:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373033 (owner: 10Marostegui) [09:34:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1064 - T172996 (duration: 00m 44s) [09:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:59] T172996: Migrate s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996 [09:35:17] !log upgrading hhvm-luasandbox on mw1180-1188 and mw1209-mw1220 (app servers) [09:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:11] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1064" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373033 (owner: 10Marostegui) [09:42:57] !log Renaming user Darwinius → DarwIn - T173159 [09:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:08] T173159: Global rename of Darwinius → DarwIn: supervision needed - https://phabricator.wikimedia.org/T173159 [09:49:27] RECOVERY - MariaDB Slave SQL: s4 on dbstore2001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [09:49:47] RECOVERY - MariaDB Slave IO: s4 on dbstore2001 is OK: OK slave_io_state Slave_IO_Running: Yes [09:52:12] (03PS1) 10Filippo Giunchedi: ferm: introduce ferm::client [puppet] - 10https://gerrit.wikimedia.org/r/373038 (https://phabricator.wikimedia.org/T173731) [09:52:14] (03PS1) 10Filippo Giunchedi: swift: don't track client connections in frontend [puppet] - 10https://gerrit.wikimedia.org/r/373039 (https://phabricator.wikimedia.org/T173731) [09:54:20] !log upgrading hhvm-luasandbox on mw1189-mw1208 (API servers) [09:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:58] PROBLEM - MariaDB Slave Lag: s2 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 464.95 seconds [10:21:38] !log another run of rebuildTermSqlIndex (T171460) [10:21:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:50] T171460: Populate term_full_entity_id on www.wikidata.org - https://phabricator.wikimedia.org/T171460 [10:24:20] (03PS18) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [10:26:29] (03PS1) 10Phuedx: relatedArticles: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373043 (https://phabricator.wikimedia.org/T165991) [10:27:47] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1957 bytes in 0.305 second response time [10:30:27] I tried to ack. it in icinga.wikimedia.org but I couldn't. Can someone do it for six hours? [10:31:05] Amir1: I will do it [10:31:13] Thank you! [10:31:57] Amir1: Done for 7 hours, to give you some more room [10:32:20] That's great. Thank you [10:32:47] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1934 bytes in 0.136 second response time [10:36:27] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3540652 (10Marostegui) I would like to propose db1076 (s2) as a candidate host to do the test once db1078 is back in the pool with the new disk. db1076 belongs to s2 and there are two more powerful hosts ther... [10:37:17] RECOVERY - MariaDB Slave Lag: s2 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 22.97 seconds [10:40:29] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3540668 (10faidon) I was looking for PDU power usage metrics. Since we don't have a Grafana dashboard yet, I tried to query Graphite manually with e... [10:41:51] !log upgrading hhvm-luasandbox on mw1293-mw1295 (image scalers) [10:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:04] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3540688 (10MoritzMuehlenhoff) The following hosts were upgraded to 2.0.13 (and HHVM restarted): mw1161-mw1167 (job runners) mw1180-mw1188 (app servers) mw1209-mw1220 (app se... [10:43:35] 10Operations, 10monitoring, 10netops, 10Patch-For-Review, 10User-fgiunchedi: Evaluate LibreNMS' Graphite backend - https://phabricator.wikimedia.org/T171167#3456248 (10Volans) At first sight it might just be that the update frequency of the data and the smallest retention period set in graphite do not ma... [11:00:48] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 9 probes of 270 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [11:04:07] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 56 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:06:27] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 52 probes of 270 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:07:11] 10Operations, 10monitoring: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150#3540712 (10akosiaris) Some preliminary results: = Authn = We can use grafana's LDAP authentication, albeit it has some caveats that are related to our currect... [11:13:09] I can't connect to any host [11:13:42] I use codfw [11:15:07] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve all events on January 15) is CRITICAL: Test retrieve all events on January 15 returned the unexpected status 400 (expecting: 200) [11:15:47] esmas is okay [11:16:07] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [11:19:07] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 268 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [11:21:27] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 7 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [11:25:38] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3540766 (10Cmjohnson) The disk was finally sent. HP added another report they wanted in addition to the AHS log. That report would have required powering the server off which is ridiculous for a failed disk.... [11:27:21] 10Operations, 10ops-eqiad, 10DBA: RAID crashed on db1078 - https://phabricator.wikimedia.org/T173365#3540772 (10Marostegui) >>! In T173365#3540766, @Cmjohnson wrote: > The disk was finally sent. HP added another report they wanted in addition > to the AHS log. That report would have required powering the se... [12:09:17] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) is CRITICAL: Test retrieve the selected anniversaries for January 15 returned the unexpected status 400 (expecting: 200) [12:11:18] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [12:19:27] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 503 (expecting: 200) [12:21:28] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [12:24:37] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 503 (expecting: 200) [12:25:37] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [12:33:47] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 503 (expecting: 200) [12:34:47] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [12:36:43] (03PS1) 10ArielGlenn: central auth global blocks tabke dump fixes [puppet] - 10https://gerrit.wikimedia.org/r/373059 [12:36:59] (03CR) 10jerkins-bot: [V: 04-1] central auth global blocks tabke dump fixes [puppet] - 10https://gerrit.wikimedia.org/r/373059 (owner: 10ArielGlenn) [12:37:30] (03PS2) 10ArielGlenn: central auth global blocks tabke dump fixes [puppet] - 10https://gerrit.wikimedia.org/r/373059 [12:41:54] (03CR) 10ArielGlenn: [C: 032] central auth global blocks tabke dump fixes [puppet] - 10https://gerrit.wikimedia.org/r/373059 (owner: 10ArielGlenn) [12:49:13] (03PS1) 10Filippo Giunchedi: prometheus: add blackbox configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/373062 (https://phabricator.wikimedia.org/T169860) [12:49:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 433.26 seconds [12:55:21] jouncebot: next [12:55:21] In 0 hour(s) and 4 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170822T1300) [12:55:28] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [12:58:06] CI slightly busy, but it should be fine for swat [12:58:37] (03PS2) 10Zfilipin: Set $wmgUseWikimediaShopLink to true for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372874 (https://phabricator.wikimedia.org/T173768) (owner: 10Urbanecm) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170822T1300). [13:00:04] Urbanecm and phuedx: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:20] I can SWAT today! [13:00:20] here [13:01:13] Urbanecm: merging, will ping you in a minute when the commit is at mwdebug [13:01:29] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372874 (https://phabricator.wikimedia.org/T173768) (owner: 10Urbanecm) [13:01:32] ack [13:02:55] (03Merged) 10jenkins-bot: Set $wmgUseWikimediaShopLink to true for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372874 (https://phabricator.wikimedia.org/T173768) (owner: 10Urbanecm) [13:03:00] o/ [13:03:09] (03CR) 10jenkins-bot: Set $wmgUseWikimediaShopLink to true for ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372874 (https://phabricator.wikimedia.org/T173768) (owner: 10Urbanecm) [13:03:21] (03CR) 10Filippo Giunchedi: "PCC says yes https://puppet-compiler.wmflabs.org/compiler02/7559/" [puppet] - 10https://gerrit.wikimedia.org/r/373062 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [13:05:00] phuedx: you are next, deploying one patch for Urbanecm, should be done in a few minutes [13:05:14] (03PS1) 10KartikMistry: apertium: Added apertium bel-rus packages and sorted list [puppet] - 10https://gerrit.wikimedia.org/r/373065 (https://phabricator.wikimedia.org/T172381) [13:05:21] zeljkof: sure, thanks! [13:05:25] got my testing urls all worked out [13:05:40] Urbanecm: patch at mwdebug [13:05:50] ack [13:05:57] this one even I should be able to test, looking o.O [13:07:34] zeljkof, is it at mwdebug1002? Can't see anything like "Wikipedia store" [13:08:01] Urbanecm: I'm also looking, don't see it, let me double check that I have deployed [13:09:05] I can see it in Colaboração but according to enwiki, it should be a little bit higher [13:09:13] merged the commit, it's on tin, rebased it, `scap pull` on mwdebug1002 [13:09:48] Loja da Wikipédia [13:09:56] it links to https://store.wikimedia.org/ [13:10:07] Jup. But this isn't mine commit, try it w/o mwdebug ;) [13:10:15] *my [13:11:21] yes, the link is there with or without x-wikimedia-debug [13:11:22] It was added by https://pt.wikipedia.org/w/index.php?title=MediaWiki:Sidebar&diff=49638049&oldid=23888656 [13:11:49] so that link should be somewhere else? [13:12:46] Yes. Look at enwiki [13:12:49] at enwiki it's in the top section of sidebar, that's where it should be? [13:13:02] Yes [13:13:17] In fact there should be two shop links. [13:14:28] yes, the one hard-coded, and the other from the setting [13:14:37] not sure what to do [13:14:38] Exactly [13:14:56] I think this is some kind of cache maybe. [13:15:02] maybe the hard coded one messes up the setting one? [13:15:10] Don't think so. [13:15:28] Can you have a look what is in $wgHooks after applying the patch? Don't know if that's possible [13:15:47] Maybe Dereckson or another deployer can help? [13:17:10] addshore or somebody, can you help out with a simple thing? :) [13:17:17] Urbanecm and I are lost [13:17:35] Urbanecm: how do I check $wgHooks? [13:18:10] As I said, I don't know if that's possible, looking at variables at mwdebug... [13:18:23] And if I don't know if that's possible, I can't know how to do it :D [13:18:50] akosiaris: thanks. https://gerrit.wikimedia.org/r/#/c/373065/ when you've time. [13:18:52] thcipriani|afk: if you have a minute... not sure how to check if https://gerrit.wikimedia.org/r/#/c/372874/ is deployed correctly [13:19:10] (03PS6) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) [13:19:37] phuedx: if you can help... ^ [13:20:20] zeljkof: it doesn't seem deployed here to me: https://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php [13:20:58] volans: thanks, not sure what I did wrong, it should be only on mwdebug1002 so far [13:21:06] not deployed everywhere yet [13:21:09] ah ok [13:21:14] then it's normal is not there [13:21:15] zeljkof: ah, I was going to say, not on terbium either :) [13:21:36] thcipriani|afk, aren't we deploying from tin? Just thinking, zeljkof said it is on tin [13:21:44] the commit is merged, I have rebased it on tin, and scap pull on mwdebug1002 [13:22:10] Urbanecm: it's on tin and mwdebug1002 as far as I understand things [13:22:17] nowhere else yet [13:22:26] mwrepl ptwiki; var_dump($wmgUseWikimediaShopLink); is bool(true) on mwdebug1002, so that seems correct :) [13:23:13] thcipriani|afk: ah, thanks, have to note that [13:23:24] we can't see it at https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:P%C3%A1gina_principal [13:23:48] according to https://en.wikipedia.org/wiki/Main_Page it should be the last link in the top-left section (sidebar) [13:24:28] Urbanecm: ok, not sure what to do, revert? deploy? [13:24:43] hrm, that...I am not sure about. There may be a step missing if it doesn't show up :\ [13:24:44] looks like nothing is broken... [13:25:16] thcipriani|afk, the wmg variable just modifies wgHooks. There should be no other step [13:25:52] zeljkof, I think you can deploy it, maybe it will show after some time. As you said, nothing seems to be broken :) [13:25:57] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 444.10 seconds [13:26:27] Urbanecm: ok, deploying, this should be easy to revert if anything goes wrong [13:26:30] ah > Loja da Wikipédia [13:26:32] is there [13:26:45] thcipriani|afk, but there should be two Loja da Wikipédia. [13:26:53] One from configuration, one from MediaWiki:Sidebar [13:27:00] thcipriani|afk: but in the second section, looks like that one is hard-coded [13:27:25] https://pt.wikipedia.org/w/index.php?title=MediaWiki:Sidebar&diff=49638049&oldid=23888656 [13:27:46] Urbanecm: but ^ says "wikipedia store" not "loja de wikipedia" [13:28:15] zeljkof, that's correct, see the next diff https://pt.wikipedia.org/w/index.php?title=MediaWiki:Sidebar&diff=next&oldid=49638049 [13:28:28] Urbanecm: ah, ok [13:28:41] anyway, deploying, let's see if we can break wikipedia with this ;) [13:29:03] Ok [13:29:45] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:372874|Set $wmgUseWikimediaShopLink to true for ptwiki (T173768)]] (duration: 00m 45s) [13:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:57] T173768: Set $wmgUseWikimediaShopLink for ptwiki - https://phabricator.wikimedia.org/T173768 [13:30:16] Urbanecm: deployed, take another look at ptwiki now [13:30:20] o.O [13:30:26] It's there! [13:30:37] PROBLEM - Check size of conntrack table on ms-fe1005 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:30:43] Urbanecm, zeljkof: i see it too [13:30:47] Urbanecm: ok, that is strange, but hey, it worked :) [13:30:49] Thank you for your deploy! [13:30:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [13:31:10] I'm taking a look at that conntrack for ms-fe1005 [13:31:15] Urbanecm: thanks for deploying with #releng ;) maybe there is something with sidebar and mwdebug... [13:31:30] phuedx: sorry, ptwiki problems, on your commit now [13:31:38] no worries [13:31:44] do 'em right [13:31:46] not quick [13:31:54] (preferably both though, obvs) [13:32:19] I'm built for comfort, not speed :'D [13:33:15] phuedx: any order files should be deplyed, common then initialise? vice-versa? [13:33:37] RECOVERY - Check size of conntrack table on ms-fe1005 is OK: OK: nf_conntrack is 77 % full [13:34:00] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3541107 (10Gehel) [13:34:01] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373043 (https://phabricator.wikimedia.org/T165991) (owner: 10Phuedx) [13:34:33] zeljkof: commonsettings first as it references a variable [13:34:45] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash, 10Discovery-Search (Current work): Failed disk on logstash1006 - https://phabricator.wikimedia.org/T173689#3541109 (10Gehel) [13:34:47] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3536569 (10Gehel) [13:34:58] (03PS2) 10Muehlenhoff: Sort output of programs needing a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373048 [13:35:20] which is removed by initialisesettings.php [13:35:41] phuedx: looked like that to me too, but I'm trying not to be smart during deploys (like I'm trying it the rest of the time) ;) [13:36:57] (03CR) 10Addshore: "Thanks for this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372761 (https://phabricator.wikimedia.org/T173571) (owner: 10Alex Monk) [13:37:01] (03PS2) 10Zfilipin: relatedArticles: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373043 (https://phabricator.wikimedia.org/T165991) (owner: 10Phuedx) [13:37:03] argh, gerrit and "cannot merge", there is something that does trivial rebases, but looks like it does not work most of the time :| [13:37:03] (03PS2) 10Filippo Giunchedi: prometheus: add blackbox configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/373062 (https://phabricator.wikimedia.org/T169860) [13:37:07] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/featured/{yyyy}/{mm}/{dd} (retrieve title of the featured article for April 29, 2016) is CRITICAL: Test retrieve title of the featured article for April 29, 2016 returned the unexpected status 400 (expecting: 200) [13:37:09] (03PS3) 10Volans: Transports: improve target management [software/cumin] - 10https://gerrit.wikimedia.org/r/367825 (https://phabricator.wikimedia.org/T171684) [13:37:11] (03PS1) 10Volans: Fix test dependency issue [software/cumin] - 10https://gerrit.wikimedia.org/r/373068 [13:37:43] (03CR) 10Muehlenhoff: [C: 032] Sort output of programs needing a restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373048 (owner: 10Muehlenhoff) [13:38:09] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [13:38:22] (03PS1) 10KartikMistry: Enable Flow as a Beta feature on wawiki and wawikionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) [13:38:58] (03CR) 10jenkins-bot: relatedArticles: Tidy up config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373043 (https://phabricator.wikimedia.org/T165991) (owner: 10Phuedx) [13:39:43] (03CR) 10jerkins-bot: [V: 04-1] Enable Flow as a Beta feature on wawiki and wawikionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) (owner: 10KartikMistry) [13:39:50] phuedx: the commit is at mwdebug1002, let me know if it looks good, so I can continue [13:39:59] zeljkof: awesome, thanks [13:40:01] testing now [13:40:18] took longer to merge because gerrit hates me :P [13:40:38] (03CR) 10Volans: "Addressed comments" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/367825 (https://phabricator.wikimedia.org/T171684) (owner: 10Volans) [13:40:52] (03CR) 10Volans: [C: 032] Fix test dependency issue [software/cumin] - 10https://gerrit.wikimedia.org/r/373068 (owner: 10Volans) [13:43:41] zeljkof: i see related articles on wikivoyage and enwiki [13:43:42] lgtm [13:43:50] phuedx: ok to deploy? [13:44:00] yessir! [13:44:08] deploying [13:44:08] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/367825 (https://phabricator.wikimedia.org/T171684) (owner: 10Volans) [13:44:19] (03PS2) 10KartikMistry: Enable Flow as a Beta feature on wawiki and wawikionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) [13:44:47] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:373043|relatedArticles: Tidy up config (T165991)]] (duration: 00m 44s) [13:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:59] T165991: Remove sidebar code from RelatedArticles extension - https://phabricator.wikimedia.org/T165991 [13:45:42] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:373043|relatedArticles: Tidy up config (T165991)]] (duration: 00m 44s) [13:45:53] phuedx: all deployed, please check [13:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:59] zeljkof: thanks! on it! [13:46:32] (03CR) 10Volans: [C: 032] Transports: improve target management [software/cumin] - 10https://gerrit.wikimedia.org/r/367825 (https://phabricator.wikimedia.org/T171684) (owner: 10Volans) [13:46:57] PROBLEM - configured eth on elastic1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:47:36] (03CR) 10Alexandros Kosiaris: [C: 032] apertium: Added apertium bel-rus packages and sorted list [puppet] - 10https://gerrit.wikimedia.org/r/373065 (https://phabricator.wikimedia.org/T172381) (owner: 10KartikMistry) [13:47:52] zeljkof: lgtm -- i'll keep an eye on it while for the next hour [13:48:03] but right now it looks like it should: a NOP [13:48:10] phuedx: thanks for releasing with #releng! :D [13:48:16] (03Merged) 10jenkins-bot: Transports: improve target management [software/cumin] - 10https://gerrit.wikimedia.org/r/367825 (https://phabricator.wikimedia.org/T171684) (owner: 10Volans) [13:48:17] !log EU SWAT finished [13:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:31] logs look fine to me too [13:48:39] PROBLEM - Host elastic1029 is DOWN: PING CRITICAL - Packet loss = 100% [13:48:54] (not that I am expert in logs...) [13:50:15] gehel: ^ [13:50:34] akosiaris: thx, yep, looking [13:51:17] PROBLEM - Check Varnish expiry mailbox lag on cp1049 is CRITICAL: CRITICAL: expiry mailbox lag is 2039472 [13:51:55] "no carrier" on eth0 on elastic1029... I guess that calls for cmjohnson1 ... [13:52:18] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve en.wp main page via mobile-sections) is CRITICAL: Test retrieve en.wp main page via mobile-sections returned the unexpected status 503 (expecting: 200) [13:53:37] RECOVERY - Host elastic1029 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [13:53:48] gehel:cable was loose [13:53:57] RECOVERY - configured eth on elastic1029 is OK: OK - interfaces up [13:53:59] cmjohnson1: that was fast! Thanks! [13:54:09] i may have done something when i was racking kafka's [13:54:27] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [13:55:02] cmjohnson1: no problem, a few lost requests and some shards moving around... [13:55:37] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2043697 [13:58:26] !log bounce varnish on cp1074 / cp1049 / cp1073 - mailbox problems [13:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:17] RECOVERY - Check Varnish expiry mailbox lag on cp1049 is OK: OK: expiry mailbox lag is 0 [14:03:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 310.21 seconds [14:05:29] (03PS1) 10Muehlenhoff: Hide Cumin output in restarts detection [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373075 [14:05:37] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [14:11:37] (03CR) 10Muehlenhoff: [C: 032] Hide Cumin output in restarts detection [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373075 (owner: 10Muehlenhoff) [14:14:24] (03CR) 10Volans: "See inline" (038 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373075 (owner: 10Muehlenhoff) [14:15:02] (03PS8) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [14:15:29] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [14:16:09] (03CR) 10Urbanecm: [C: 031] Enable Flow as a Beta feature on wawiki and wawikionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373069 (https://phabricator.wikimedia.org/T172947) (owner: 10KartikMistry) [14:16:55] (03PS9) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [14:17:37] PROBLEM - MariaDB Slave Lag: s1 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 328.39 seconds [14:18:27] PROBLEM - puppet last run on restbase-dev1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [14:19:16] (03PS1) 10Muehlenhoff: Migrate Cumin interface into a separate function [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373079 [14:19:47] (03CR) 10Muehlenhoff: [C: 032] Migrate Cumin interface into a separate function [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373079 (owner: 10Muehlenhoff) [14:22:39] (03PS1) 10Andrew Bogott: labs instances: switch salt-master to labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/373080 (https://phabricator.wikimedia.org/T171786) [14:22:41] (03PS1) 10Andrew Bogott: remove role::labs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/373081 (https://phabricator.wikimedia.org/T171786) [14:23:25] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3541284 (10Andrew) I'm going to leave labcontrol1001 as the salt master. No sense in rebuilding this when we're going to stop using salt soon, and the... [14:23:28] 10Operations, 10MediaWiki-extensions-Scribunto: Build and push a new hhvm-luasandbox package - https://phabricator.wikimedia.org/T171166#3541286 (10Anomie) >>! In T171166#3540325, @MoritzMuehlenhoff wrote: > The migrated canary servers are looking fine. I was initially irritated by https://phabricator.wikimedi... [14:23:40] (03CR) 10Andrew Bogott: [C: 032] labs instances: switch salt-master to labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/373080 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [14:25:06] (03PS2) 10Andrew Bogott: remove role::labs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/373081 (https://phabricator.wikimedia.org/T171786) [14:25:07] PROBLEM - puppet last run on restbase-dev1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [14:26:35] (03PS1) 10Muehlenhoff: Add changes suggested by volans in earlier review [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373082 [14:27:07] PROBLEM - puppet last run on restbase-dev1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [14:29:08] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Altrincham page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp Altrincham page via mobile-sections-lead returned the unexpected status 503 (expecting: 200) [14:31:18] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [14:31:54] (03CR) 10Volans: [C: 031] "LGTM, suggestion for a further improvement inline." (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373082 (owner: 10Muehlenhoff) [14:32:10] (03PS10) 10Rush: openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) [14:33:54] (03CR) 10Muehlenhoff: [C: 032] Add changes suggested by volans in earlier review [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/373082 (owner: 10Muehlenhoff) [14:34:17] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{mm}/{dd} (retrieve the selected anniversaries for January 15) is CRITICAL: Test retrieve the selected anniversaries for January 15 returned the unexpected status 400 (expecting: 200) [14:35:27] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [14:37:41] (03CR) 10Herron: "Could we incorporate the default bounce/warn text [1] while also adding the References header?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) (owner: 10Alexandros Kosiaris) [14:38:06] 10Operations, 10DBA, 10Wikimedia-Site-requests, 10User-MarcoAurelio: Global rename: Opdire657 → Sakiv; supervision needed - https://phabricator.wikimedia.org/T173834#3541365 (10MarcoAurelio) [14:48:47] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 26 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:48:58] RECOVERY - MariaDB Slave Lag: s1 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 49.12 seconds [14:52:23] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3541450 (10Cmjohnson) [14:53:47] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 8 probes of 269 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [14:56:54] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo1004 h/w problem most likely raid card - https://phabricator.wikimedia.org/T173837#3541461 (10Cmjohnson) [15:01:22] (03PS1) 10BBlack: browsersec: add pt, bn, ru, sv, he, sq [puppet] - 10https://gerrit.wikimedia.org/r/373086 (https://phabricator.wikimedia.org/T163251) [15:01:25] (03PS1) 10BBlack: Varnish: move errorpage/browsersec to common code [puppet] - 10https://gerrit.wikimedia.org/r/373087 (https://phabricator.wikimedia.org/T163251) [15:01:27] (03PS1) 10BBlack: browsersec: remove wiki-colon filtering [puppet] - 10https://gerrit.wikimedia.org/r/373088 (https://phabricator.wikimedia.org/T163251) [15:02:29] (03CR) 10Alexandros Kosiaris: "That's a fair question. I 've been thinking about it as well and are somewhat ambivalent, here's my thoughts" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) (owner: 10Alexandros Kosiaris) [15:02:55] (03CR) 10Rush: [C: 032] openstack: keystone as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/370288 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:03:29] (03PS2) 10Alexandros Kosiaris: mail::mx: Ship bounce/warn message files [puppet] - 10https://gerrit.wikimedia.org/r/372848 (https://phabricator.wikimedia.org/T173733) [15:06:38] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:10:49] (03PS1) 10Rush: openstack: remove old keystone from nova::controller role [puppet] - 10https://gerrit.wikimedia.org/r/373089 (https://phabricator.wikimedia.org/T171494) [15:11:38] (03CR) 10Rush: [C: 032] openstack: remove old keystone from nova::controller role [puppet] - 10https://gerrit.wikimedia.org/r/373089 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:13:47] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [15:16:14] 10Operations, 10MediaWiki-JobQueue: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3541596 (10Jdforrester-WMF) From the chart, the big drop in errors happened on 2017-08-14 (and was caused by the closing of wikis which then had outstanding jobs that could never complete, AIUI); th... [15:16:48] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3541608 (10Jdforrester-WMF) [15:17:54] (03PS2) 10BBlack: browsersec: add pt, bn, ru, sv, he, sq [puppet] - 10https://gerrit.wikimedia.org/r/373086 (https://phabricator.wikimedia.org/T163251) [15:17:55] (03PS2) 10BBlack: Varnish: move errorpage/browsersec to common code [puppet] - 10https://gerrit.wikimedia.org/r/373087 (https://phabricator.wikimedia.org/T163251) [15:17:57] (03PS2) 10BBlack: browsersec: remove wiki-colon filtering [puppet] - 10https://gerrit.wikimedia.org/r/373088 (https://phabricator.wikimedia.org/T163251) [15:19:01] (03CR) 10BBlack: [C: 032] browsersec: add pt, bn, ru, sv, he, sq [puppet] - 10https://gerrit.wikimedia.org/r/373086 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [15:20:16] (03PS1) 10Rush: openstack: update keystone template [puppet] - 10https://gerrit.wikimedia.org/r/373090 (https://phabricator.wikimedia.org/T171494) [15:20:43] (03PS2) 10Rush: openstack: update keystone template [puppet] - 10https://gerrit.wikimedia.org/r/373090 (https://phabricator.wikimedia.org/T171494) [15:21:04] (03CR) 10Rush: [V: 032 C: 032] openstack: update keystone template [puppet] - 10https://gerrit.wikimedia.org/r/373090 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [15:37:09] (03PS3) 10BBlack: Varnish: move errorpage/browsersec to common code [puppet] - 10https://gerrit.wikimedia.org/r/373087 (https://phabricator.wikimedia.org/T163251) [15:37:11] (03PS3) 10BBlack: browsersec: remove wiki-colon filtering [puppet] - 10https://gerrit.wikimedia.org/r/373088 (https://phabricator.wikimedia.org/T163251) [15:37:46] (03CR) 10BBlack: [C: 032] Varnish: move errorpage/browsersec to common code [puppet] - 10https://gerrit.wikimedia.org/r/373087 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [15:37:54] (03CR) 10Ayounsi: [C: 031] prometheus: add blackbox configuration for prometheus::ops [puppet] - 10https://gerrit.wikimedia.org/r/373062 (https://phabricator.wikimedia.org/T169860) (owner: 10Filippo Giunchedi) [15:37:56] (03CR) 10BBlack: [C: 032] browsersec: remove wiki-colon filtering [puppet] - 10https://gerrit.wikimedia.org/r/373088 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [15:38:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3541723 (10Andrew) For example: ``` CPU 0 BANK 18 TSC b270de9d648a RIP !INEXACT! 10:ffffffff8146c0e8 MISC c0fe2010821cc086 ADDR 3f62282b00 TIME 1502812434 Tue Aug 15 15:5... [15:41:48] (03PS9) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [15:48:08] (03PS2) 10Gehel: eventlogging - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342229 [15:49:39] (03CR) 10Volans: [C: 031] "LGTM, minor nitpicking inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [15:54:09] (03CR) 10Volans: [C: 031] "Potential improvement, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [15:55:49] (03CR) 10Ottomata: [C: 031] "+1, as long as we are sure that template isn't used elsewhere :)" [puppet] - 10https://gerrit.wikimedia.org/r/342229 (owner: 10Gehel) [15:56:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Data-Services: Split up labstore external shelf storage available in codfw between labstore2001 and 2 - https://phabricator.wikimedia.org/T171623#3541756 (10Papaul) a:05Papaul>03madhuvishy @Cmjohnson I have no issues @madhuvishy This is complete please check and... [15:59:17] (03PS10) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [15:59:39] (03CR) 10Gehel: logrotate - introduce a generic logrotate template (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [15:59:45] (03CR) 10jerkins-bot: [V: 04-1] logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [16:00:05] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170822T1600). Please do the needful. [16:00:30] no patches [16:00:48] (03PS11) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [16:04:25] (03PS3) 10Gehel: eventlogging - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342229 [16:07:23] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3541768 (10Papaul) p:05Normal>03Lowest [16:12:10] (03CR) 10Volans: [C: 031] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [16:12:12] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: Move logstash ingestion behind LVS - https://phabricator.wikimedia.org/T151971#3541777 (10fgiunchedi) [16:12:14] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review: fix partition scheme for logstash ingester hosts - https://phabricator.wikimedia.org/T150108#3541774 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi Resolving, logstash ingestion is moving to ganeti [16:13:46] (03CR) 10Krinkle: [C: 04-1] webperf: Convert navtiming.py to use KafkaConsumer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [16:14:01] (03PS4) 10Gehel: eventlogging - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342229 [16:16:23] (03CR) 10Gehel: "Puppet compiler looks happy: https://puppet-compiler.wmflabs.org/compiler02/7562/eventlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/342229 (owner: 10Gehel) [16:17:25] (03PS1) 10BBlack: browsersec: add back wiki-colon filtering for text only [puppet] - 10https://gerrit.wikimedia.org/r/373099 (https://phabricator.wikimedia.org/T163251) [16:19:33] (03PS12) 10Gehel: logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 [16:21:09] (03CR) 10Gehel: [C: 032] logrotate - introduce a generic logrotate template [puppet] - 10https://gerrit.wikimedia.org/r/342228 (owner: 10Gehel) [16:21:55] (03PS5) 10Gehel: eventlogging - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342229 [16:22:38] 10Operations, 10ops-codfw: failing RAID disk on frdb2001 - https://phabricator.wikimedia.org/T171584#3541817 (10RobH) 05Open>03Resolved [16:23:25] (03PS2) 10BBlack: browsersec: add back wiki-colon filtering for text only [puppet] - 10https://gerrit.wikimedia.org/r/373099 (https://phabricator.wikimedia.org/T163251) [16:24:48] (03CR) 10Gehel: [C: 032] eventlogging - use the new logrotate::rule [puppet] - 10https://gerrit.wikimedia.org/r/342229 (owner: 10Gehel) [16:25:58] (03PS3) 10BBlack: browsersec: add back wiki-colon filtering for text only [puppet] - 10https://gerrit.wikimedia.org/r/373099 (https://phabricator.wikimedia.org/T163251) [16:26:15] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): Degraded RAID on logstash1006 - https://phabricator.wikimedia.org/T173679#3541835 (10Cmjohnson) A self dispatch has been ordered with Dell. Work Order: SR952745470 [16:26:20] (03CR) 10Krinkle: "@Ottomata This is probably the most Python code I've written in one day so far :) - I learned a lot about the different data types, arg sp" [puppet] - 10https://gerrit.wikimedia.org/r/372577 (https://phabricator.wikimedia.org/T104902) (owner: 10Krinkle) [16:28:16] (03CR) 10BBlack: [C: 032] browsersec: add back wiki-colon filtering for text only [puppet] - 10https://gerrit.wikimedia.org/r/373099 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [16:37:44] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10User-Elukey: kafka-jumbo1004 h/w problem most likely raid card - https://phabricator.wikimedia.org/T173837#3541879 (10Cmjohnson) open a self dispatch with Dell for a new raid card [16:45:51] (03CR) 10Krinkle: [C: 04-1] webperf: Convert navtiming.py to use KafkaConsumer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) (owner: 10Krinkle) [16:46:04] (03PS3) 10Krinkle: webperf: Convert navtiming.py to use KafkaConsumer [puppet] - 10https://gerrit.wikimedia.org/r/372483 (https://phabricator.wikimedia.org/T110903) [16:52:43] jdlrobson: bmansurov: I was going to SWAT https://gerrit.wikimedia.org/r/#/c/372593/ for you, but I don't think I'll make it today. Could one of you instead? [16:58:02] (03PS4) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [16:58:23] (03PS5) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [16:59:15] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3541951 (10RobH) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170822T1700). Please do the needful. [17:00:15] Nothing for ORES today [17:01:35] (03CR) 10Muehlenhoff: Run Lilypond from Firejail (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/370358 (https://phabricator.wikimedia.org/T172582) (owner: 10Ebe123) [17:02:10] Krinkle, ok, i'll do it [17:02:16] Thanks! [17:02:46] Krinkle, np, how would I verify the fix after it's deployed? [17:03:27] bmansurov: Verify that, now, Special:Upload and Special:Notifications on enwiki have no icons on the top right for "Alerts" and "Notices", but that they do on any other page, and after the fix is applied, they do have icons, [17:03:39] on those pages. [17:03:43] great, thanks [17:07:37] !log starting branch cut for 1.30.0-wmf.15 [17:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:26] (03PS1) 10Reedy: test [puppet] - 10https://gerrit.wikimedia.org/r/373107 [17:12:30] robh: ^ wfm [17:12:45] bleh [17:15:03] (03PS1) 10RobH: setting labmon1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/373108 (https://phabricator.wikimedia.org/T165784) [17:16:41] (03CR) 10RobH: [C: 032] setting labmon1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/373108 (https://phabricator.wikimedia.org/T165784) (owner: 10RobH) [17:24:56] ACKNOWLEDGEMENT - Check systemd state on logstash1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel failed disk - https://phabricator.wikimedia.org/T173679 [17:24:56] ACKNOWLEDGEMENT - ElasticSearch health check for shards on logstash1006 is CRITICAL: CRITICAL - elasticsearch http://10.64.48.109:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.48.109, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fd66b1deb90: Failed to establish a new connection: [Errno 111] Co [17:24:56] Gehel failed disk - https://phabricator.wikimedia.org/T173679 [17:29:10] (03PS3) 10Andrew Bogott: remove role::labs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/373081 (https://phabricator.wikimedia.org/T171786) [17:29:52] (03CR) 10Andrew Bogott: [C: 032] remove role::labs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/373081 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [17:32:58] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work), 10Patch-For-Review: Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3150511 (10debt) This isn't urgent quite yet, but we're doing work around this issue that might help o... [17:37:13] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3542123 (10Papaul) a:05Papaul>03elukey @elukey main board replacement complete. [17:42:52] PROBLEM - puppetmaster https on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 8140: Connection refused [17:52:20] (03PS1) 10Andrew Bogott: remove cnames for old labs puppetmasters [dns] - 10https://gerrit.wikimedia.org/r/373113 (https://phabricator.wikimedia.org/T171786) [17:53:52] jynus: ping? need some help with mysql queries [17:54:20] !log removing obsolete apache2 and puppetmaster packages from labcontrol boxes for https://phabricator.wikimedia.org/T171786 [17:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:38] (03CR) 10Smalyshev: wdqs - send logs to logstash (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [18:00:52] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3542228 (10Andrew) I merged the patch removing puppetmaster from labcontrols. Then, on labcontrol1001, 1002, and labtestcontrol1001 I did the followin... [18:02:45] (03PS1) 10Herron: WIP: Add shiladsen shell account [puppet] - 10https://gerrit.wikimedia.org/r/373115 [18:02:48] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3542236 (10Andrew) [18:08:10] (03PS1) 10ArielGlenn: start of setup of dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/373117 (https://phabricator.wikimedia.org/T169849) [18:08:33] (03CR) 10Gehel: wdqs - send logs to logstash (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [18:08:36] (03CR) 10jerkins-bot: [V: 04-1] start of setup of dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/373117 (https://phabricator.wikimedia.org/T169849) (owner: 10ArielGlenn) [18:13:21] (03PS2) 10ArielGlenn: start of setup of dumpsdata hosts [puppet] - 10https://gerrit.wikimedia.org/r/373117 (https://phabricator.wikimedia.org/T169849) [18:27:20] marostegui/etc. in case someone has time for a global rename. [18:27:28] (will also file a ticket for tracking) [18:31:46] 10Operations, 10DBA, 10Wikimedia-Site-requests: Papa1234 → Karl-Heinz JansenPapa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3542365 (10Steinsplitter) [18:36:31] 10Operations, 10DBA, 10Wikimedia-Site-requests: Global rename supervision request: Papa1234 → Karl-Heinz Jansen - https://phabricator.wikimedia.org/T173859#3542408 (10Steinsplitter) [18:40:48] !log firmware update of labservices1002 in progress [18:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:40] (03PS1) 10Andrew Bogott: shinkengen: pull in keystone and puppet hosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/373122 [18:47:47] (03CR) 10Andrew Bogott: [C: 032] shinkengen: pull in keystone and puppet hosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/373122 (owner: 10Andrew Bogott) [18:50:01] !log labservies1003 update completed [18:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:16] !log labservies1002 update completed (not 1003, typo) [18:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:44] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3542481 (10RobH) [18:51:46] (03PS1) 10Andrew Bogott: define labs_keystone_host on labs instances [puppet] - 10https://gerrit.wikimedia.org/r/373123 [18:53:32] (03CR) 10Andrew Bogott: [C: 032] define labs_keystone_host on labs instances [puppet] - 10https://gerrit.wikimedia.org/r/373123 (owner: 10Andrew Bogott) [18:56:09] (03PS1) 10Andrew Bogott: add keystone_config:public_port to labs instance hiera [puppet] - 10https://gerrit.wikimedia.org/r/373124 [18:56:49] (03CR) 10Andrew Bogott: [C: 032] add keystone_config:public_port to labs instance hiera [puppet] - 10https://gerrit.wikimedia.org/r/373124 (owner: 10Andrew Bogott) [18:57:14] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3542524 (10RobH) [18:58:43] (03PS6) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [18:58:50] 10Operations, 10Toolforge, 10Toolforge-standards-committee, 10Traffic, 10HTTPS: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#3542533 (10Quiddity) [18:59:08] (03CR) 10jerkins-bot: [V: 04-1] wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170822T1900). [19:00:05] (03PS7) 10Gehel: wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) [19:00:10] * thcipriani does [19:00:31] (03CR) 10jerkins-bot: [V: 04-1] wdqs - send logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [19:00:36] !log thcipriani@tin Started scap: testwiki to php-1.30.0-wmf.15 and rebuild l10n cache [19:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:02] (03PS1) 10Andrew Bogott: add explicit keystone_public_port to labs hiera [puppet] - 10https://gerrit.wikimedia.org/r/373125 [19:02:18] (03CR) 10Andrew Bogott: [C: 032] add explicit keystone_public_port to labs hiera [puppet] - 10https://gerrit.wikimedia.org/r/373125 (owner: 10Andrew Bogott) [19:06:37] (03PS1) 10RobH: setting labmon1002 to role spare [puppet] - 10https://gerrit.wikimedia.org/r/373128 (https://phabricator.wikimedia.org/T165784) [19:07:30] (03CR) 10RobH: [C: 032] setting labmon1002 to role spare [puppet] - 10https://gerrit.wikimedia.org/r/373128 (https://phabricator.wikimedia.org/T165784) (owner: 10RobH) [19:10:29] 10Operations, 10Discovery, 10Maps-Sprint, 10Maps (Kartographer), and 2 others: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3542612 (10debt) p:05Triage>03High The maps-test cluster is nearly done with being re-imaged and once it's fully rebuilt, we'll take a look at this. [19:10:55] (03PS29) 10Ppchelko: Increase max kafka message size for changeprop and kafka main [puppet] - 10https://gerrit.wikimedia.org/r/372179 [19:11:17] (03CR) 10Ppchelko: "@Ottomata Done!" [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [19:11:46] (03CR) 10Andrew Bogott: [C: 032] remove cnames for old labs puppetmasters [dns] - 10https://gerrit.wikimedia.org/r/373113 (https://phabricator.wikimedia.org/T171786) (owner: 10Andrew Bogott) [19:12:12] (03CR) 10Ppchelko: "@Ottomata - and there will be larger messages produced - there're some job that are being produced now that are ~2 Megs." [puppet] - 10https://gerrit.wikimedia.org/r/372179 (owner: 10Ppchelko) [19:13:21] 10Operations, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3542622 (10Andrew) [19:23:07] (03CR) 10RobH: [C: 031] "lgtm, dont forget to update the commit message to include bug: T171988" [puppet] - 10https://gerrit.wikimedia.org/r/373115 (owner: 10Herron) [19:23:14] (03CR) 10Smalyshev: wdqs - send logs to logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/371939 (https://phabricator.wikimedia.org/T172710) (owner: 10Gehel) [19:24:12] (03PS1) 10Urbanecm: Enable wgEchoPerUserBlacklist at all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373133 (https://phabricator.wikimedia.org/T173838) [19:28:06] 10Operations, 10MediaWiki-JobQueue, 10Performance-Team: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3542688 (10aaron) Mostly htmlCacheUpdate jobs on wikidatawiki: htmlCacheUpdate: 6014947 queued; 5 claimed (0 active, 5 abandoned); 0 delayed [19:33:48] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3542733 (10Reedy) [19:34:20] (03PS1) 10Dzahn: icinga: add plugin to check for long running screens [puppet] - 10https://gerrit.wikimedia.org/r/373135 (https://phabricator.wikimedia.org/T165348) [19:40:50] (03PS2) 10Herron: Add shiladsen shell account [puppet] - 10https://gerrit.wikimedia.org/r/373115 (https://phabricator.wikimedia.org/T171988) [19:40:53] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3542755 (10RobH) a:05RobH>03chasemp [19:41:29] 10Operations, 10Cloud-Services: rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784#3276770 (10RobH) labmon1002 is now ready for cloud team implementation. I've assigned to @chasemp since he made the initial hardware request. [19:42:02] (03CR) 10Herron: [C: 032] Add shiladsen shell account [puppet] - 10https://gerrit.wikimedia.org/r/373115 (https://phabricator.wikimedia.org/T171988) (owner: 10Herron) [19:42:23] (03PS3) 10Herron: Add shiladsen shell account [puppet] - 10https://gerrit.wikimedia.org/r/373115 (https://phabricator.wikimedia.org/T171988) [19:45:32] 10Operations, 10Ops-Access-Requests: Requesting access to restricted hosts for dbarratt - https://phabricator.wikimedia.org/T173779#3539582 (10RobH) Please note that restricted is a sudo level group, and this request has to pass approval during the weekly (Monday) operations meetings. Additionally, @dbarratt... [19:46:32] !log thcipriani@tin Finished scap: testwiki to php-1.30.0-wmf.15 and rebuild l10n cache (duration: 45m 55s) [19:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:52] (03PS1) 10Thcipriani: Group0 to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373137 [19:52:59] PROBLEM - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:53:00] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [19:53:00] PROBLEM - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.51 and port 9042: Connection refused [19:53:29] PROBLEM - cassandra-c SSL 10.192.48.51:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [19:53:57] (03CR) 10Thcipriani: [C: 032] Group0 to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373137 (owner: 10Thcipriani) [19:54:29] PROBLEM - Check systemd state on restbase2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:54:30] PROBLEM - cassandra-a service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [19:54:49] PROBLEM - Check systemd state on restbase2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:55:09] PROBLEM - cassandra-c service on restbase2006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [19:55:25] (03Merged) 10jenkins-bot: Group0 to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373137 (owner: 10Thcipriani) [19:55:34] (03CR) 10jenkins-bot: Group0 to 1.30.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373137 (owner: 10Thcipriani) [19:56:57] !log starting cassandra restbase2004-a and restbase2006-c, OOMs [19:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:30] RECOVERY - Check systemd state on restbase2006 is OK: OK - running: The system is fully operational [19:57:39] RECOVERY - cassandra-a service on restbase2004 is OK: OK - cassandra-a is active [19:57:50] RECOVERY - Check systemd state on restbase2004 is OK: OK - running: The system is fully operational [19:58:09] RECOVERY - cassandra-c service on restbase2006 is OK: OK - cassandra-c is active [19:59:00] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.037 second response time on 10.192.32.137 port 9042 [19:59:09] RECOVERY - cassandra-c CQL 10.192.48.51:9042 on restbase2006 is OK: TCP OK - 0.036 second response time on 10.192.48.51 port 9042 [19:59:10] RECOVERY - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-a valid until 2018-07-19 10:52:21 +0000 (expires in 330 days) [19:59:29] RECOVERY - cassandra-c SSL 10.192.48.51:7001 on restbase2006 is OK: SSL OK - Certificate restbase2006-c valid until 2018-07-19 10:52:32 +0000 (expires in 330 days) [20:00:09] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.30.0-wmf.15 [20:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:23] 10Operations, 10Ops-Access-Requests: Requesting access to restricted hosts for dbarratt - https://phabricator.wikimedia.org/T173779#3542840 (10dbarratt) >>! In T173779#3542773, @RobH wrote: > Additionally, @dbarratt must read and sign the L3 document on phabricator. Done. [20:08:09] PROBLEM - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is CRITICAL: connect to address 10.192.32.137 and port 9042: Connection refused [20:08:39] PROBLEM - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:10:49] RECOVERY - cassandra-a SSL 10.192.32.137:7001 on restbase2004 is OK: SSL OK - Certificate restbase2004-a valid until 2018-07-19 10:52:21 +0000 (expires in 330 days) [20:11:09] RECOVERY - cassandra-a CQL 10.192.32.137:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on 10.192.32.137 port 9042 [20:11:30] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 307.67 seconds [20:13:30] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 320.48 seconds [20:21:02] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3542897 (10Dzahn) So i started implementing this as an Icinga check. Unless you specifically meant for it _not_ to be an Icinga check and just run in cron and send email if it d... [20:25:36] !log restbase-dev* - puppet runs fail due to E: Version '3.11.0' for 'cassandra' was not found [20:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:49] RECOVERY - MariaDB Slave Lag: s3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 50.62 seconds [20:38:50] (03PS1) 10Niharika29: Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 [20:39:47] (03PS2) 10Niharika29: Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 [20:39:57] (03CR) 10jerkins-bot: [V: 04-1] Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 (owner: 10Niharika29) [20:46:33] (03PS3) 10Niharika29: Redo "Enable CodeMirror everywhere but RTL wikis and wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 [20:47:23] (03PS4) 10Niharika29: Redo "Enable CodeMirror everywhere but RTL wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 [20:51:04] (03CR) 10Luke081515: [C: 04-1] Redo "Enable CodeMirror everywhere but RTL wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 (owner: 10Niharika29) [20:51:29] @seen FlorianSW [20:51:29] greg-g: Last time I saw FlorianSW they were joining the channel, but they are not in the channel now and I don't know why, in #mediawiki-i18n at 6/20/2017 12:25:51 AM (63d20h25m38s ago) [20:51:42] thcipriani: ^ [20:52:08] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3542976 (10herron) Shell account `shiladsen` has been added to puppet and deployed to stat systems: stat1003:~$ id shiladsen uid=... [20:52:15] (03CR) 10Niharika29: Redo "Enable CodeMirror everywhere but RTL wikis" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 (owner: 10Niharika29) [20:52:30] greg-g: [22:52:15] -NickServ- Last seen : Jun 21 19:26:20 2017 (8w 6d 1h ago) [20:52:32] greg-g: thanks, I just reverted [20:52:39] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3542977 (10herron) [20:53:05] thcipriani: +1 [20:53:38] Sagan: thanks :) [20:55:08] (03CR) 10Niharika29: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 (owner: 10Niharika29) [20:56:01] (03PS5) 10Niharika29: Redo "Enable CodeMirror everywhere but RTL wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 [20:56:38] (03CR) 10Niharika29: "Gah. I don't know. I'll make a new patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 (owner: 10Niharika29) [20:57:40] 10Operations, 10Ops-Access-Requests: +2 in job-related repos for new Android engineers - https://phabricator.wikimedia.org/T173874#3542983 (10Mholloway) [20:59:17] (03PS1) 10Niharika29: Deploy CodeMirror to all non-RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373144 (https://phabricator.wikimedia.org/T170966) [20:59:24] (03Abandoned) 10Niharika29: Redo "Enable CodeMirror everywhere but RTL wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373142 (owner: 10Niharika29) [21:00:05] MaxSem and Niharika: Respected human, time to deploy CodeMirror deployment (for realz) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170822T2100). Please do the needful. [21:00:19] MaxSem: https://gerrit.wikimedia.org/r/#/c/373144/ [21:02:02] (03CR) 10MaxSem: [C: 032] Deploy CodeMirror to all non-RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373144 (https://phabricator.wikimedia.org/T170966) (owner: 10Niharika29) [21:03:30] (03Merged) 10jenkins-bot: Deploy CodeMirror to all non-RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373144 (https://phabricator.wikimedia.org/T170966) (owner: 10Niharika29) [21:03:48] (03CR) 10jenkins-bot: Deploy CodeMirror to all non-RTL wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373144 (https://phabricator.wikimedia.org/T170966) (owner: 10Niharika29) [21:04:03] MaxSem: Test! [21:18:24] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Retry CodeMirror deployment T170966 (duration: 00m 49s) [21:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:36] T170966: Epic: Tracking task for CodeMirror deployment - https://phabricator.wikimedia.org/T170966 [21:47:33] 10Operations, 10Ops-Access-Requests, 10Analytics, 10Research, and 2 others: NDA, MOU and LDAP (analytics cluster) for Shilad Sen - https://phabricator.wikimedia.org/T171988#3543170 (10DarTar) Thanks @herron. @Shilad: Aaron is currently traveling, but let me know if you need any assistance. If we have an... [21:56:10] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 24268.52 seconds [21:57:34] (03PS1) 10RobH: two new ldap users sharvaniharan and Cooltey [puppet] - 10https://gerrit.wikimedia.org/r/373148 [21:57:59] (03CR) 10jerkins-bot: [V: 04-1] two new ldap users sharvaniharan and Cooltey [puppet] - 10https://gerrit.wikimedia.org/r/373148 (owner: 10RobH) [21:59:28] stupid spacing in commit msg =P [21:59:32] (03PS2) 10RobH: two new ldap users sharvaniharan and Cooltey [puppet] - 10https://gerrit.wikimedia.org/r/373148 (https://phabricator.wikimedia.org/T173874) [21:59:35] i did it the wrong way for too many years [22:01:06] (03PS1) 10Rush: openstack: glance as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/373150 (https://phabricator.wikimedia.org/T171494) [22:01:34] (03CR) 10jerkins-bot: [V: 04-1] openstack: glance as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/373150 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:04:10] (03PS2) 10Rush: openstack: glance as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/373150 (https://phabricator.wikimedia.org/T171494) [22:11:47] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3543236 (10Mholloway) [22:44:50] (03PS3) 10Rush: openstack: glance as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/373150 (https://phabricator.wikimedia.org/T171494) [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170822T2300). Please do the needful. [23:00:05] bmansurov and Niharika: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:12] here [23:00:22] o/ [23:01:13] (03CR) 10Rush: [C: 032] openstack: glance as module/profile/role for deployments [puppet] - 10https://gerrit.wikimedia.org/r/373150 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [23:02:06] I can SWAT [23:06:43] (03PS1) 10Rush: openstack: glance set db params as variables [puppet] - 10https://gerrit.wikimedia.org/r/373154 (https://phabricator.wikimedia.org/T171494) [23:07:40] bmansurov: your patch is on mwdebug1002, check please [23:07:43] (03CR) 10Rush: [C: 032] openstack: glance set db params as variables [puppet] - 10https://gerrit.wikimedia.org/r/373154 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [23:07:52] thcipriani, checking [23:08:24] thcipriani, working! [23:08:30] bmansurov: going live [23:08:30] I see the missing icons now [23:08:33] cool [23:08:40] (03CR) 10BBlack: "This isn't a situation where you need backup plans. It's checked in realtime, and we can always amend it in realtime if we decide to chan" [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) (owner: 10RobH) [23:10:33] !log thcipriani@tin Synchronized php-1.30.0-wmf.14/extensions/Popups/includes/PopupsHooks.php: SWAT: [[gerrit:372593|Remove aborting of BeforePageDisplay hook]] T173411 (duration: 00m 49s) [23:10:38] ^ bmansurov live now [23:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:47] T173411: Notifications icons are broken on some special pages (Special:GadgetUsage, Special:Upload, Special:Notifications) (module ext.echo.styles.badge not loaded) - https://phabricator.wikimedia.org/T173411 [23:11:05] thcipriani, thanks! [23:11:48] Niharika: looks like your is an l10 only. I'll run a full scap for it now and let you know when completes. [23:13:12] !log thcipriani@tin Started scap: SWAT: [[gerrit:372902|Fix typo in the notification message]] [23:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:45] thcipriani: I have a patch for a fatal that needs to go out (https://gerrit.wikimedia.org/r/#/c/373157/), could I or you sync it out after the scap? [23:15:54] legoktm: sure, I can get it out after this scap. Should be fairly quick since I just did a full sync a few hours ago. [23:18:58] thanks :) [23:21:26] 10Operations, 10ops-eqiad, 10Cloud-Services: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3543357 (10RobH) [23:23:47] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3543361 (10RobH) It is typically best to handle each user request as their own task, since there will be steps that each user has to handle on their own. Ho... [23:28:11] 10Operations, 10Gerrit, 10LDAP-Access-Requests: Add new users Sharvaniharan and Cooltey to releasers-mobile - https://phabricator.wikimedia.org/T173886#3543368 (10RobH) p:05Triage>03Normal [23:35:02] !log thcipriani@tin Finished scap: SWAT: [[gerrit:372902|Fix typo in the notification message]] (duration: 21m 48s) [23:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:54] apergos: ping? [23:36:10] Niharika: you l10n change should be live [23:36:19] legoktm: your change is on mwdebug1002, anything to check? [23:36:25] thcipriani: Thank you so much! [23:36:30] yw :) [23:37:03] * legoktm tests [23:37:18] thcipriani: confirmed that https://www.mediawiki.org/wiki/Special:LintErrors/pwrap-bug-workaround doesn't fatal on mwdebug1002, so LGTM [23:37:28] okie doke, going live [23:38:28] By the way, that scap run was pretty quick, 22 minutes. [23:39:30] !log thcipriani@tin Synchronized php-1.30.0-wmf.15/extensions/Linter/includes/LintErrorsPager.php: SWAT: [[gerrit:373157|Fix up 11f4a97ba6bcd0c1de]] (duration: 00m 49s) [23:39:40] ^ legoktm live now [23:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:40:06] thanks :) [23:40:23] Niharika: yeah, scap is definitely quicker when it's been run recently :) [23:40:38] Interesting. [23:46:23] (03PS3) 10BBlack: Setting namecheap/comodo CAA records [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) (owner: 10RobH) [23:46:25] (03PS1) 10BBlack: wikimedia.org CAA: split issue-vs-issuewild, document clearer [dns] - 10https://gerrit.wikimedia.org/r/373163 [23:46:44] (03CR) 10jerkins-bot: [V: 04-1] Setting namecheap/comodo CAA records [dns] - 10https://gerrit.wikimedia.org/r/372900 (https://phabricator.wikimedia.org/T173787) (owner: 10RobH) [23:46:48] (03CR) 10jerkins-bot: [V: 04-1] wikimedia.org CAA: split issue-vs-issuewild, document clearer [dns] - 10https://gerrit.wikimedia.org/r/373163 (owner: 10BBlack) [23:48:47] 10Operations, 10Cloud-Services: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3543422 (10RobH) a:05RobH>03chasemp [23:49:06] 10Operations, 10Cloud-Services: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3276633 (10RobH) These are both all setup and ready for cloud team to take over. Assigned to @chasemp for followup. [23:50:03] 10Operations, 10Cloud-Services: rack/setup/install labnet100[34] - https://phabricator.wikimedia.org/T165779#3543426 (10chasemp) Thanks @robh