[00:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Evening SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171114T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:05:55] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:11:41] Is it just me or the number of SWAT patches has been pretty low lately? [00:16:16] I think it has been [00:16:27] Maybe it's because it's not the end of a quarter any more? :P [00:16:44] Or maybe some of the volunteers contributing lots of config changes haven't been as active [00:34:12] (03CR) 10Jforrester: [C: 04-1] "Needs T180335 done first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390386 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [00:51:15] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/media/image/featured/{yyyy}/{mm}/{dd} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 500 (expecting: 200) [00:53:16] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [01:30:49] !log mobileapps rolling back today's deployment due to a significant increase in errors. [01:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:31:50] !log mholloway-shell@tin Started deploy [mobileapps/deploy@00e60b2]: (no justification provided) [01:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:18] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@00e60b2]: (no justification provided) (duration: 02m 27s) [01:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:00] !log mobileapps finished rolling back to 11/8 deployment [01:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:15] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [01:54:05] RECOVERY - Host cp3048 is UP: PING OK - Packet loss = 0%, RTA = 83.78 ms [02:01:04] (03PS1) 10Ayounsi: [WIP] Have every rdns advertise a private anycast VIP [puppet] - 10https://gerrit.wikimedia.org/r/391149 [02:24:21] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.7) (duration: 07m 28s) [02:24:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Nov 14 02:30:59 UTC 2017 (duration 6m 38s) [02:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:45] PROBLEM - Host cp3048 is DOWN: PING CRITICAL - Packet loss = 100% [03:03:45] RECOVERY - Host cp3048 is UP: PING WARNING - Packet loss = 61%, RTA = 83.79 ms [03:26:01] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 632.04 seconds [04:08:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 177.28 seconds [05:31:07] 10Operations, 10ops-esams, 10Traffic: cp3048 crashed - https://phabricator.wikimedia.org/T180424#3757577 (10BBlack) [06:18:49] !log Deploy alter table on s6 primary master (db1061) - T174569 [06:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:56] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [06:24:12] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391159 (https://phabricator.wikimedia.org/T178359) [06:26:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391159 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:27:36] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391159 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:27:46] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1103 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391159 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:28:44] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase weight for db1103 on s2 and s4 - T178359 (duration: 00m 47s) [06:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:51] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [06:30:40] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:32:30] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [06:39:03] (03PS1) 10Marostegui: db1105.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/391160 (https://phabricator.wikimedia.org/T178359) [06:39:59] (03CR) 10Marostegui: [C: 032] db1105.yaml: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/391160 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [06:50:41] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:57:58] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1105 as rc for s1,s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391161 (https://phabricator.wikimedia.org/T178359) [07:07:19] (03PS1) 10Krinkle: [WIP] Split profile.php from StartProfiler and add PhpAutoPrepend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) [07:09:57] (03PS2) 10Krinkle: [WIP] Split profile.php from StartProfiler and add PhpAutoPrepend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) [07:12:32] PROBLEM - HHVM rendering on mw2151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:13:31] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 74501 bytes in 0.293 second response time [07:21:31] (03PS1) 10Urbanecm: Disable EducationProgram on cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391163 (https://phabricator.wikimedia.org/T180426) [07:29:01] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0 [07:29:22] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0 [07:52:24] 10Operations, 10Performance-Team, 10monitoring, 10Graphite: Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3757681 (10Peter) [07:58:45] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3757696 (10Peter) [08:06:38] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3757706 (10Joe) Since we had no more alarms in real-world situations, I think we can safely close this ticket now . [08:06:45] 10Operations, 10Patch-For-Review, 10User-Joe: etcd cluster in codfw has raft consensus issues - https://phabricator.wikimedia.org/T162013#3757707 (10Joe) 05Open>03Resolved [08:13:41] !log rebooting job runners in codfw for update to Linux 4.9.51 (and to pick up OpenSSL updates) [08:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:58] (03PS2) 10Addshore: Add AdvancedSearch to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390385 (https://phabricator.wikimedia.org/T180147) [08:19:03] (03CR) 10Addshore: [C: 032] Add AdvancedSearch to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390385 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [08:20:15] (03Merged) 10jenkins-bot: Add AdvancedSearch to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390385 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [08:20:25] (03CR) 10jenkins-bot: Add AdvancedSearch to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390385 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [08:21:54] !log addshore@tin Synchronized wmf-config/extension-list: [[gerrit:390385|Add AdvancedSearch to extension-list]] T180147 PT 1/2 (duration: 00m 47s) [08:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:01] T180147: Deploy AdvancedSearch extension to group0 - https://phabricator.wikimedia.org/T180147 [08:22:58] !log addshore@tin Synchronized wmf-config/extension-list-labs: [[gerrit:390385|Add AdvancedSearch to extension-list]] T180147 PT 2/2 LABS ONLY (duration: 00m 46s) [08:23:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:11] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/dumpcirrussearch.sh] [08:31:57] !log Deploy alter table on s5 - dbstore1002 - T174569 [08:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:04] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [08:34:12] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3757723 (10mobrovac) Thnx @Mholloway, +1 from me... [08:37:03] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Possibly faulty BBU on analytics1029 - https://phabricator.wikimedia.org/T178742#3757724 (10elukey) 05Open>03Resolved Everything seems good, removed downtime for the host. Thanks Chris! [08:42:21] PROBLEM - HHVM rendering on mw2225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:43:11] RECOVERY - HHVM rendering on mw2225 is OK: HTTP OK: HTTP/1.1 200 OK - 74535 bytes in 0.329 second response time [08:44:09] 10Operations, 10ops-esams, 10Traffic: cp3048 crashed - https://phabricator.wikimedia.org/T180424#3757731 (10Peachey88) [08:45:24] (03PS1) 10Mobrovac: JobQueue: Switch RecordLintJob to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391166 (https://phabricator.wikimedia.org/T175212) [08:46:17] (03PS4) 10Elukey: role::prometheus::analytics: add druid jmx exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/391007 (https://phabricator.wikimedia.org/T177459) [08:47:31] PROBLEM - Nginx local proxy to apache on mw2145 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:48:21] RECOVERY - Nginx local proxy to apache on mw2145 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.197 second response time [08:49:11] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:51:11] PROBLEM - MariaDB Slave Lag: s5 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 887.14 seconds [08:51:36] ^ the alter table [08:51:38] I am going to silence it [09:00:04] Pchelolo, mobrovac, and _joe_: That opportune time is upon us again. Time for a JobQueue Migration deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171114T0900). [09:00:04] No GERRIT patches in the queue for this window AFAICS. [09:00:12] RECOVERY - MariaDB Slave Lag: s5 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 94.42 seconds [09:00:53] 10Operations, 10ops-codfw: Degraded RAID on wtp2017 - https://phabricator.wikimedia.org/T180373#3757767 (10MoritzMuehlenhoff) 05Open>03Invalid Duplicate of T180373 [09:05:12] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: wtp2018.codfw.wmnet [09:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:24] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: mw2108.codfw.wmnet [09:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:08] 10Operations, 10ops-codfw: Broken memory on mw2108 - https://phabricator.wikimedia.org/T180200#3757770 (10MoritzMuehlenhoff) Thanks, I ran "scap pull" and repooled the host. [09:07:29] !log rebooting remaining app servers in codfw for update to Linux 4.9.51 (and to pick up OpenSSL updates) [09:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:22] RECOVERY - mediawiki-installation DSH group on mw2108 is OK: OK [09:15:54] (03PS2) 10WMDE-leszek: Load DataTypes extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391029 (https://phabricator.wikimedia.org/T180062) [09:17:57] (03PS3) 10WMDE-leszek: Load DataTypes extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391029 (https://phabricator.wikimedia.org/T180062) [09:22:14] (03PS4) 10WMDE-leszek: Load DataTypes extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391029 (https://phabricator.wikimedia.org/T180062) [09:22:39] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::analytics: add druid jmx exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/391007 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [09:22:41] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: add druid jmx exporter settings [puppet] - 10https://gerrit.wikimedia.org/r/391007 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [09:26:23] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#3757809 (10tstarling) [09:27:16] !log ppchelko@tin Started deploy [cpjobqueue/deploy@df5fca9]: Enable RecordLintJob processing [09:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:56] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@df5fca9]: Enable RecordLintJob processing (duration: 00m 40s) [09:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:00] (03PS1) 10Filippo Giunchedi: cassandra: reprovision restbase2002 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/391169 (https://phabricator.wikimedia.org/T179422) [09:32:57] 10Operations, 10Traffic, 10Performance-Team (Radar): Upgrade cache_upload to Varnish 5 - https://phabricator.wikimedia.org/T180433#3757836 (10ema) [09:39:45] (03CR) 10Jcrespo: [C: 031] db-eqiad,db-codfw.php: Pool db1105 as rc for s1,s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391161 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:39:52] thanks [09:40:19] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Pool db1105 as rc for s1,s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391161 (https://phabricator.wikimedia.org/T178359) [09:41:14] 10Operations, 10Traffic: Uncacheable content handling: hfp vs hfm - https://phabricator.wikimedia.org/T180434#3757861 (10ema) [09:41:16] 10Operations, 10Traffic: Uncacheable content handling: hfp vs hfm - https://phabricator.wikimedia.org/T180434#3757876 (10ema) p:05Triage>03Normal [09:42:44] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Pool db1105 as rc for s1,s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391161 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:43:35] !log upgrade prometheus to 1.8.1 with k8s on prometheus2004 - T177395 [09:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:42] T177395: Improve monitoring of the Kubernetes clusters - https://phabricator.wikimedia.org/T177395 [09:43:53] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1105 as rc for s1,s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391161 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:45:12] (03CR) 10Mobrovac: [C: 031] cassandra: reprovision restbase2002 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/391169 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [09:45:24] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Pool db1105 as multi-instance host for s1 and s2 - T178359 (duration: 00m 47s) [09:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:31] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [09:46:13] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1105 as rc for s1,s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391161 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [09:46:19] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Pool db1105 as multi-instance host for s1 and s2 - T178359 (duration: 00m 46s) [09:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:51] (03PS1) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [09:49:07] !log Disabled 2FA for Jean-Frédéric [09:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:21] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [09:51:17] <_joe_> uhm [09:51:22] <_joe_> can someone check icinga please? [09:51:34] <_joe_> I'm following the jobqueue migration [09:52:52] marostegui: i see you syncing mw-config stuff, we need the floor for 10 mins or so [09:53:12] mobrovac: yeah, not planning to deploy anything soon :) [09:53:18] kk thnx [09:55:08] (03CR) 10Mobrovac: [C: 032] JobQueue: Switch RecordLintJob to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391166 (https://phabricator.wikimedia.org/T175212) (owner: 10Mobrovac) [09:56:21] The icinga error is "Could not find any hostgroup matching druid_public_eqiad" [09:56:27] !log Deploy alter table on s5 - dbstore1001 - T174569 [09:56:28] ^ elukey ? [09:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:33] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [09:56:36] yeah, seems related to the last patch merged [09:57:07] sigh [09:57:10] sorry checking [09:57:11] (03Merged) 10jenkins-bot: JobQueue: Switch RecordLintJob to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391166 (https://phabricator.wikimedia.org/T175212) (owner: 10Mobrovac) [09:57:20] (03CR) 10jenkins-bot: JobQueue: Switch RecordLintJob to EventBus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391166 (https://phabricator.wikimedia.org/T175212) (owner: 10Mobrovac) [09:58:50] elukey: public_druid_eqiad needs an entry in common/monitoring.yaml [09:59:17] don't you love that $cluster variable ;-) ? [09:59:29] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: JobQueue: Migrate RecordLintJob to EventBus - T175212 (duration: 00m 46s) [09:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:35] T175212: Services Q2 2017/18 goal: Migrate a subset of jobs to multi-DC enabled event processing infrastructure. - https://phabricator.wikimedia.org/T175212 [10:00:03] (03CR) 10Addshore: [C: 04-1] "Also now planned for the 29th Nov" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390387 (https://phabricator.wikimedia.org/T180128) (owner: 10Addshore) [10:00:15] akosiaris: thanks! fixing it a sec [10:01:53] (03CR) 10Addshore: [C: 04-1] "Now scheduled for the 22nd Nov" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390386 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [10:02:39] (03PS1) 10Elukey: monitoring.yaml: add druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/391173 (https://phabricator.wikimedia.org/T177459) [10:03:08] (03CR) 10jerkins-bot: [V: 04-1] monitoring.yaml: add druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/391173 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:04:24] I know jenkins you are right [10:04:27] thanks for spotting it [10:04:28] (03PS2) 10Elukey: monitoring.yaml: add druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/391173 (https://phabricator.wikimedia.org/T177459) [10:05:07] akosiaris: like this right? --^ [10:05:45] (03CR) 10Alexandros Kosiaris: [C: 031] monitoring.yaml: add druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/391173 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:07:16] (03CR) 10Elukey: [C: 032] monitoring.yaml: add druid clusters [puppet] - 10https://gerrit.wikimedia.org/r/391173 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:07:27] !log upload scap 3.7.2-1 - T127762 [10:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:34] T127762: Update Debian Package for Scap3 - https://phabricator.wikimedia.org/T127762 [10:08:35] (03PS2) 10Filippo Giunchedi: Update scap to 3.7.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/390355 (owner: 1020after4) [10:10:13] (03CR) 10Filippo Giunchedi: [C: 032] Update scap to 3.7.2-1 [puppet] - 10https://gerrit.wikimedia.org/r/390355 (owner: 1020after4) [10:10:19] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [10:10:28] \o/ [10:23:04] !log rebooting image scalers in eqiad for update to Linux 4.9.51 (and to pick up OpenSSL updates) [10:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:53] (03PS1) 10Elukey: role::prometheus::analytics: remove redundant target configs [puppet] - 10https://gerrit.wikimedia.org/r/391179 (https://phabricator.wikimedia.org/T177459) [10:26:34] (03CR) 10Filippo Giunchedi: [C: 031] role::prometheus::analytics: remove redundant target configs [puppet] - 10https://gerrit.wikimedia.org/r/391179 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:26:41] (03CR) 10Elukey: [C: 032] role::prometheus::analytics: remove redundant target configs [puppet] - 10https://gerrit.wikimedia.org/r/391179 (https://phabricator.wikimedia.org/T177459) (owner: 10Elukey) [10:32:00] !log removed old target configs from /srv/prometheus/analytics/targets on prometheus100[34] after https://gerrit.wikimedia.org/r/391179 [10:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:19] mobrovac: can I deploy some mediawiki-config? [10:33:32] yup yup marostegui, we are done [10:33:40] great - thanks! [10:34:59] (03PS1) 10Marostegui: db-eqiad.php: Give db1105 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391181 (https://phabricator.wikimedia.org/T178359) [10:37:47] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Give db1105 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391181 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:38:58] (03Merged) 10jenkins-bot: db-eqiad.php: Give db1105 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391181 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:39:08] (03CR) 10jenkins-bot: db-eqiad.php: Give db1105 more traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391181 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [10:41:17] (03PS1) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [10:41:50] (03CR) 10jerkins-bot: [V: 04-1] Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) (owner: 10Jcrespo) [10:43:58] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3758108 (10Volans) I personally don't think that using the native annotations available in Grafana 4.6 from an automated system (WebPageTest in this case) is a good... [10:46:08] (03PS1) 10Marostegui: db-eqiad.php: Depool db1106 and db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391184 (https://phabricator.wikimedia.org/T174569) [10:48:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1106 and db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391184 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [10:49:14] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3755911 (10Joe) This whole event brings forward a larger question about microservices and their cost. This service did cost variou... [10:49:23] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1106 and db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391184 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [10:49:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1106 and db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391184 (https://phabricator.wikimedia.org/T174569) (owner: 10Marostegui) [10:50:42] !log Deploy alter table on s5: db1104 db1100 db1106 - T174569 [10:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:48] T174569: Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569 [10:51:18] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:54:31] 10Operations, 10Traffic: Change "CP" cookie from subdomain to project level - https://phabricator.wikimedia.org/T180407#3758143 (10Nemo_bis) I assume the same would apply to the "UseDC" cookie? [10:56:10] (03PS2) 10Filippo Giunchedi: cassandra: reprovision restbase2002 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/391169 (https://phabricator.wikimedia.org/T179422) [10:59:09] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [10:59:18] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 [11:00:39] (03CR) 10Filippo Giunchedi: [C: 032] cassandra: reprovision restbase2002 with cassandra 3 [puppet] - 10https://gerrit.wikimedia.org/r/391169 (https://phabricator.wikimedia.org/T179422) (owner: 10Filippo Giunchedi) [11:02:53] (03PS2) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [11:03:22] (03CR) 10jerkins-bot: [V: 04-1] Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) (owner: 10Jcrespo) [11:04:50] !log reimage restbase2002 - T179422 [11:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:57] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [11:05:16] (03PS3) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [11:09:40] (03CR) 10Jcrespo: "Aaron: I asked Giuseppe to review and deploy this 3 weeks ago- so I assume he is the one blocking it for some reason." [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) (owner: 10Jcrespo) [11:12:53] (03PS4) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [11:13:07] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [11:13:21] 10Operations, 10Traffic: Puppet / LVS: confusion in service vs IP name - https://phabricator.wikimedia.org/T180257#3758164 (10ema) p:05Triage>03Normal [11:21:26] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:22:39] !log rebooting remaining API servers in eqiad for update to Linux 4.9.51 (and to pick up OpenSSL updates) [11:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:07] (03PS5) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [11:34:07] (03PS8) 10Jcrespo: proxysql: Setup proxysql on terbium/wasat as a test [puppet] - 10https://gerrit.wikimedia.org/r/384695 (https://phabricator.wikimedia.org/T175672) [11:40:31] (03PS6) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [11:48:35] (03PS3) 10Ema: varnish: log slow requests [puppet] - 10https://gerrit.wikimedia.org/r/390258 [11:49:48] !log rebooting remaining app servers in eqiad for update to Linux 4.9.51 (and to pick up OpenSSL updates) [11:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:25] while checking ores too often I get this "Request from 91.3.165.56 via cp3007 cp3007, Varnish XID 4812612 Error: 503, Backend fetch failed at Tue, 14 Nov 2017 11:52:47 GMT" [11:53:38] emsams is not happy? [11:54:04] *esams [11:55:38] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3758380 (10Joe) >>! In T180384#3756999, @Krenair wrote: >>>! In T180384#3756237, @Fjalapeno wrote: >> Really the concept needs more... [11:56:55] Amir1: can you try to reproduce using another DC? [11:57:04] eg: curl --resolve ores.wikimedia.org:443:208.80.153.248 -I https://ores.wikimedia.org [11:57:07] (codfw) [11:57:21] let me try [11:58:15] It works just fine in another dc: [11:58:18] https://www.irccloud.com/pastebin/miyWYopv/ [11:59:21] the exact same url errors out any time I try it for esams [11:59:39] yeah thanks Amir1 I'm trying to figure out what's wrong with cp3007 [12:00:57] All of them route to cp3007, that's the only thing is consistent [12:01:01] Thank you [12:01:24] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3755911 (10faidon) Do we really need all this for an endpoint marked as "experimental"? Rolling out a more experimental service is... [12:03:14] !log restart varnish-be on cp3007, requests failing with 'no backend connection' [12:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:47] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3758400 (10fgiunchedi) If the concern is sqlite scalability I don't foresee that being a problem even for automatic annotations unless grafana does really inefficien... [12:05:12] Amir1: I've tried the curl you've pasted above and it seems to go through cp3007 fine now, can you confirm? [12:05:39] yes, it's happy now [12:05:45] Thanks! [12:07:17] np [12:12:46] (03CR) 10Filippo Giunchedi: [C: 031] varnish: log slow requests [puppet] - 10https://gerrit.wikimedia.org/r/390258 (owner: 10Ema) [12:13:52] (03CR) 10Ema: [C: 032] varnish: log slow requests [puppet] - 10https://gerrit.wikimedia.org/r/390258 (owner: 10Ema) [12:15:21] PROBLEM - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:31] <_joe_> ema: nice! [12:19:44] _joe_: thanks! There's a minor WTF re:programname [12:20:08] for varnish-be it works: "program:varnish-slowreqs" [12:20:30] for varnish-fe it's not set and I suspect some issues with the programname being too long (varnish-frontend-slowreqs) [12:20:41] PROBLEM - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:21:35] at any rate you can see the logs starting to get into logstash by searching for 'slowreqs' in kibana [12:23:43] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3755911 (10Pchelolo) > I would say having a marker (the "experimental" one, or a similar one) and setting expectations to be "it ma... [12:25:10] heh [12:25:16] http://www.rsyslog.com/sende-messages-with-tags-larger-than-32-characters/ [12:25:51] PROBLEM - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:26:12] PROBLEM - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:26:25] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3758440 (10Joe) >! In T180384#3758434, @Pchelolo wrote: > I went to hive to check the external traffic to the endpoint from web req... [12:26:41] PROBLEM - cassandra-a SSL 10.192.16.165:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:26:51] PROBLEM - cassandra-a service on restbase2002 is CRITICAL: NRPE: Command check_cassandra-a-state not defined [12:27:01] PROBLEM - cassandra-b service on restbase2002 is CRITICAL: NRPE: Command check_cassandra-b-state not defined [12:27:41] PROBLEM - cassandra-c SSL 10.192.16.167:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:28:33] expected, i guess godog ^ ? [12:29:53] mobrovac: seems so, he logged a reimage earlier on [12:30:59] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3758450 (10Joe) An undeployment procedure would be: [] remove the endpoint from restbase, This needs a restbase deployment. [] Rem... [12:36:30] (03PS1) 10Ladsgroup: Comply wikidata with new ores thresholds [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391197 (https://phabricator.wikimedia.org/T180450) [12:36:58] (03PS1) 10Jcrespo: WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) [12:38:07] (03CR) 10Jcrespo: "This is in no way ready to be deployed, but it is open for comments." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:38:36] (03CR) 10jerkins-bot: [V: 04-1] WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:38:47] indeed mobrovac, expired downtime [12:39:05] hehe thought as much, thnx godog [12:39:47] (03PS2) 10Jcrespo: WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) [12:41:38] (03CR) 10jerkins-bot: [V: 04-1] WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:42:07] oh wth restbase2002 didn't pick up the jbod change at the time of reimage [12:44:15] (03PS1) 10Ema: varnish-slowreqs: reduce program length [puppet] - 10https://gerrit.wikimedia.org/r/391199 [12:44:45] (03CR) 10jerkins-bot: [V: 04-1] varnish-slowreqs: reduce program length [puppet] - 10https://gerrit.wikimedia.org/r/391199 (owner: 10Ema) [12:45:07] (03PS3) 10Jcrespo: WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) [12:45:17] (03Abandoned) 10WMDE-leszek: Load DataTypes extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391029 (https://phabricator.wikimedia.org/T180062) (owner: 10WMDE-leszek) [12:45:42] bah I'll ran the reimage again [12:45:49] (03PS2) 10Ema: varnish-slowreqs: reduce programname length [puppet] - 10https://gerrit.wikimedia.org/r/391199 [12:46:27] (03CR) 10jerkins-bot: [V: 04-1] WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [12:47:50] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3758612 (10mobrovac) >>! In T180384#3758380, @Joe wrote: >>>! In T180384#3756999, @Krenair wrote: >>>>! In T180384#3756237, @Fjalap... [12:51:05] (03PS4) 10Jcrespo: WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) [12:55:01] (03PS5) 10Jcrespo: WIP: setup s8 and db-common.php refactoring [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) [12:56:01] (03PS2) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [12:57:37] !log upgrade grafana to 4.6.1 on https://grafana-labs.wikimedia.org/ - T180428 [12:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:44] T180428: Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428 [12:58:44] phedenskog: ^ if you want to test grafana 4.6 on grafana-labs [13:03:26] (03PS1) 10Ema: cache_misc: disable varnish_be<->varnish_be max_connections [puppet] - 10https://gerrit.wikimedia.org/r/391204 [13:10:53] !log shutdown install1002 for disk resize [13:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:19] (03CR) 10Marostegui: WIP: setup s8 and db-common.php refactoring (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391198 (https://phabricator.wikimedia.org/T177208) (owner: 10Jcrespo) [13:11:35] !log installing openssl updates [13:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:03] PROBLEM - Host install1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:15:12] RECOVERY - Host install1002 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [13:17:53] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391206 (https://phabricator.wikimedia.org/T178359) [13:18:02] and /dev/vda1 177G 65G 104G 39% / [13:18:09] _joe_: moritzm ^ [13:18:10] :-) [13:18:11] akosiaris: <3 [13:18:22] PROBLEM - DPKG on labsdb1009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:20:37] nice, thanks [13:22:01] godog: thanks! tested and added a annotation on the fly in the gui, super useful! [13:24:36] (03CR) 10Ema: [C: 032] cache_misc: disable varnish_be<->varnish_be max_connections [puppet] - 10https://gerrit.wikimedia.org/r/391204 (owner: 10Ema) [13:27:56] (03PS1) 10Muehlenhoff: Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/391207 [13:28:51] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391206 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:31:40] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391206 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:31:53] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391206 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [13:40:59] akosiaris: thanks! [13:45:03] PROBLEM - HHVM jobrunner on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 9.579 second response time [13:45:12] PROBLEM - Nginx local proxy to apache on mw1301 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 1.390 second response time [13:45:53] RECOVERY - HHVM jobrunner on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.007 second response time [13:46:09] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3758808 (10Mholloway) I think that would be @dr0... [13:46:12] RECOVERY - Nginx local proxy to apache on mw1301 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.006 second response time [13:54:00] (03CR) 10Jcrespo: [C: 04-1] "The idea of this is perfect, but the commit will be totally updated after a month (new usages will be lost, potentially making it a breaki" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [13:54:06] phedenskog: awesome, I'll upgrade production grafana tomorrow then [13:54:21] (03CR) 10Jcrespo: [C: 04-1] "s/updated/outdated/" [puppet] - 10https://gerrit.wikimedia.org/r/383519 (owner: 10Giuseppe Lavagetto) [13:55:58] (03PS7) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [13:58:18] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3758821 (10Volans) @fgiunchedi yes I'm worried about scalability in terms of transactions write rate (if the automated tools will add many annotations in a short per... [13:59:08] zeljkof: nothing to deploy today :) [13:59:24] zeljkof: hi! just adding a patch for swat :) [14:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171114T1400). [14:00:05] No GERRIT patches in the queue for this window AFAICS. [14:00:19] dun dun dun [14:01:57] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391209 (https://phabricator.wikimedia.org/T178359) [14:02:10] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika ^ hi! just added a small CentralNotice patch... [14:02:12] o/ moritzm [14:02:23] I'm ready for the DB restart we discussed when you are. [14:03:10] (03PS8) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [14:03:40] AndyRussG: I can SWAT today, if nobody else is already doing it? cc hashar addshore [14:03:51] * addshore is busy! [14:03:52] zeljkof: thanks so much!!!!!! [14:03:57] really appreciate it :) [14:04:04] ok, for the record... [14:04:08] I can SWAT today! [14:04:11] sorry for the late addition :) [14:04:33] AndyRussG: no problem at all, it's not too late even half way in the window [14:05:15] heheh [14:05:26] cool beans [14:06:02] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 3 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3758875 (10Tobi_WMDE_SW) [14:07:00] !log installing postgres security updates on labsdb1004 [14:07:02] AndyRussG: ok, so maybe a problem [14:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:14] I am not sure that wmf_deploy is the correct branch [14:09:14] shouldn't it be something like wmf/1.31.0-wmf.7? [14:09:20] AndyRussG: ^ [14:10:16] I see the commit in master branch [14:10:16] https://gerrit.wikimedia.org/r/#/q/Ida5e777d435d157cfcf6ead7728b35b6beb330b2 [14:10:38] and wmf_deploy, but shouldn't you cherry pick it to wmf/1.31.0-wmf.7 branch? [14:10:51] addshore: ^ [14:11:02] *reads up* [14:11:21] zeljkof: CentralNotice is a bit of a special case: https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json#L192 [14:11:30] it magic bumps ALL OF THE THINGS [14:11:34] in ALL OF THE BRANCHES [14:11:43] thcipriani|afk: ah [14:11:47] magic [14:12:00] so, how to I deploy it? all normal? [14:12:16] pretty much [14:12:24] ok, will try [14:12:29] yep, merge and then fetch inside php-1.31.0-wmf.7 and deploy [14:12:37] Reedy: zeljkof addshore yea we need to change that infact [14:12:50] thcipriani|afk: strange [14:12:54] (but not until after the year-end fundraiser...) [14:13:24] the submodule pointer for CN points to the wmf_deploy branch [14:13:46] currently are no wmf.x branches in the CN repository per se [14:14:36] thcipriani|afk: are you around in the next few minutes, just to check if I am doing it right? [14:14:49] a bit nervous since I have never done it [14:14:58] yeah, there's a new scap version out today, so I'm stalking :) [14:15:06] thcipriani|afk: great :) [14:15:13] should be the same as all other backports though [14:15:54] yeah, just doesn't need doing on all branches [14:16:26] thcipriani|afk, Reedy: ok, so this is the first step? [14:16:29] zfilipin@tin:/srv/mediawiki-staging/php-1.31.0-wmf.7/extensions/CentralNotice$ git fetch [14:17:02] eh, just git fetch in /srv/mediawiki-staging/php-1.31.0-wmf.7 and it should pull down the submodule bump [14:17:15] git log -p HEAD..@{u} to ensure the submodule bump is what you pulled down [14:17:28] then: git rebase; git submodule update extensions/CentralNotice [14:18:04] thcipriani|afk: ok, that worked [14:18:53] (03PS2) 10Ottomata: [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) [14:19:19] sorry for special snowflake [14:19:21] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [14:19:23] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3758962 (10jcrespo) @Cmjohnson sorry for the confusion- indeed **it is ok to put down db1048**. All other conversations were about failover to db1059 to substitute... [14:19:44] 10Operations, 10ops-eqiad, 10DBA, 10Phabricator: Decommission db1048 (was Move m3 slave to db1059) - https://phabricator.wikimedia.org/T175679#3758963 (10jcrespo) a:05jcrespo>03Cmjohnson [14:20:22] thcipriani|afk: thanks, looks like everything is fine so far [14:20:31] AndyRussG: the commit is at mwdebug1002, can you test there? [14:21:04] (03PS1) 10Ottomata: Support looking up secrets in different modules [puppet] - 10https://gerrit.wikimedia.org/r/391214 [14:21:34] (03CR) 10jerkins-bot: [V: 04-1] Support looking up secrets in different modules [puppet] - 10https://gerrit.wikimedia.org/r/391214 (owner: 10Ottomata) [14:22:49] (03CR) 10Ottomata: "Somehow I suspect yall will not like this :)" [puppet] - 10https://gerrit.wikimedia.org/r/391214 (owner: 10Ottomata) [14:23:11] zeljkof: yep! one sec :) [14:26:06] zeljkof: all good! :) thanks!!!! [14:26:13] AndyRussG: deploying... [14:27:35] no sure why jouncebot is quiet... [14:27:38] jouncebot: next [14:27:38] In 2 hour(s) and 32 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171114T1700) [14:27:40] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): All Reading Infrastructure engineers should have deploy rights for all services Readers engineering maintains - https://phabricator.wikimedia.org/T180366#3759020 (10ssastry) >>! In T180366#3757723, @mob... [14:27:52] AndyRussG: it's deployed, please check [14:28:01] anything else for EU SWAT today? [14:28:58] AndyRussG: thanks for deploying with #releng! ;) [14:29:03] !log EU SWAT finished [14:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391209 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [14:29:20] :) [14:29:58] zeljkof: seems fine! thanks again!!!! [14:30:24] thcipriani|afk: not sure if it's related to new scap release, but stashbot did not log the deployment at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:40] zeljkof: thanks, looking at that now [14:31:02] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391209 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [14:31:16] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1105 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391209 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [14:32:02] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3759041 (10akosiaris) How do you plan on working around `github.com/docker/distribution/reference`. If it's vendoring [1] then might as well do it... [14:33:12] (03CR) 10Zoranzoki21: [C: 031] Disable EducationProgram on cs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391163 (https://phabricator.wikimedia.org/T180426) (owner: 10Urbanecm) [14:33:33] zeljkof: I just deployed and it didn't !log either [14:33:48] (03CR) 10Alexandros Kosiaris: "What's the point of storing certificates in the private repo ? They are anyway meant to be served to clients that can be for all intents a" [puppet] - 10https://gerrit.wikimedia.org/r/391214 (owner: 10Ottomata) [14:34:01] thcipriani|afk: ^ [14:34:04] (03Abandoned) 10BBlack: normalize_path: assert(url) [puppet] - 10https://gerrit.wikimedia.org/r/274087 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [14:34:08] (03Abandoned) 10BBlack: normalize_path: refactor control flow [puppet] - 10https://gerrit.wikimedia.org/r/274088 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [14:34:12] (03Abandoned) 10BBlack: normalize_path: fully parameterize the decoded set [puppet] - 10https://gerrit.wikimedia.org/r/274089 (https://phabricator.wikimedia.org/T127387) (owner: 10BBlack) [14:34:30] marostegui: thanks [14:36:35] I *think* I found the problem... [14:37:13] (03CR) 10Jcrespo: [C: 031] "https://puppet-compiler.wmflabs.org/compiler02/8759/" [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) (owner: 10Jcrespo) [14:37:20] (03PS9) 10Jcrespo: Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) [14:37:23] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738#3759069 (10fgiunchedi) re: long term storage of data in Prometheus I wanted to expand on it also wrt hardware requirements in {T175364}. See https://phabricator.wikimedia.org/T180105#37... [14:37:59] !log installing postgres security updates on maps* [14:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:16] (03CR) 10Jcrespo: [C: 032] Refactor evenlogging database hosts definition [puppet] - 10https://gerrit.wikimedia.org/r/391182 (https://phabricator.wikimedia.org/T177405) (owner: 10Jcrespo) [14:39:18] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3759092 (10fgiunchedi) [14:39:20] 10Operations, 10fundraising-tech-ops: Long term storage for frack prometheus data - https://phabricator.wikimedia.org/T175738#3759090 (10fgiunchedi) 05Resolved>03Open reopening for visibility re: last comment, @cwdent @Jgreen [14:43:52] (03CR) 10BBlack: [C: 031] varnish-slowreqs: reduce programname length [puppet] - 10https://gerrit.wikimedia.org/r/391199 (owner: 10Ema) [14:44:55] (03PS1) 10BBlack: [WIP] normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [14:45:36] (03PS3) 10Ema: varnish-slowreqs: reduce programname length [puppet] - 10https://gerrit.wikimedia.org/r/391199 [14:45:41] (03CR) 10Ema: [V: 032 C: 032] varnish-slowreqs: reduce programname length [puppet] - 10https://gerrit.wikimedia.org/r/391199 (owner: 10Ema) [14:50:00] (03PS4) 10Muehlenhoff: cumin: use new syntax in aliases [puppet] - 10https://gerrit.wikimedia.org/r/389983 (owner: 10Volans) [14:50:27] (03CR) 10Ema: [C: 031] Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/391207 (owner: 10Muehlenhoff) [14:50:55] (03PS2) 10BBlack: [WIP] normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [14:51:09] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3650139 (10chasemp) pinging myself here to not forget :) [14:52:41] (03PS1) 10Giuseppe Lavagetto: Add some additional unit tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/391217 [14:53:07] (03CR) 10jerkins-bot: [V: 04-1] Add some additional unit tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/391217 (owner: 10Giuseppe Lavagetto) [14:54:00] <_joe_> meh [14:54:30] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1105 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391218 (https://phabricator.wikimedia.org/T178359) [14:55:28] (03PS1) 10Thcipriani: scap: upgrade to 3.7.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/391219 (https://phabricator.wikimedia.org/T127762) [14:55:42] (03PS2) 10Giuseppe Lavagetto: Add some additional unit tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/391217 [14:55:48] (03CR) 10Muehlenhoff: [C: 031] "Nice, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/389983 (owner: 10Volans) [14:56:06] (03PS1) 10Zoranzoki21: Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) [14:56:27] (03PS3) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [14:56:32] (03CR) 10Volans: [C: 032] cumin: use new syntax in aliases [puppet] - 10https://gerrit.wikimedia.org/r/389983 (owner: 10Volans) [14:57:02] (03CR) 10Filippo Giunchedi: [C: 032] scap: upgrade to 3.7.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/391219 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [14:57:08] (03PS2) 10Filippo Giunchedi: scap: upgrade to 3.7.3-1 [puppet] - 10https://gerrit.wikimedia.org/r/391219 (https://phabricator.wikimedia.org/T127762) (owner: 10Thcipriani) [14:57:53] (03PS4) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [14:57:56] (03CR) 10Giuseppe Lavagetto: [C: 032] Add some additional unit tests [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/391217 (owner: 10Giuseppe Lavagetto) [14:58:01] RECOVERY - DPKG on labsdb1009 is OK: All packages OK [14:58:16] godog: thank you! [14:58:25] thcipriani|afk: np, {{done}} [14:58:55] godog: \o/ could you run puppet on tin and then I can test the logging fix? [14:59:06] (03PS1) 10ArielGlenn: move base::firewall out of dumps/snapshots modules and into roles [puppet] - 10https://gerrit.wikimedia.org/r/391221 [14:59:22] (03PS2) 10Zoranzoki21: Added throttle rule for course on Wikipedia at a Medicine Faculty campus. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391220 (https://phabricator.wikimedia.org/T180441) [14:59:53] thcipriani|afk: I am on tin already, let me do that [15:00:03] thcipriani|afk: yup, running [15:00:06] :) [15:00:07] doh, timing marostegui [15:00:10] thanks both :) [15:00:26] finished now [15:00:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1105 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391218 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:00:37] !log Decommissioning Cassandra, restbase1012-c.eqiad.wmnet (T179422) [15:00:39] !log thcipriani@tin testing IRC logging [15:00:44] \o/ [15:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:47] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [15:00:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:03] (03CR) 10Zoranzoki21: [C: 031] Switch submit button from 'save' to 'publish' on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391046 (owner: 10Jforrester) [15:01:06] godog: looks fixed. thank you for the quick reply, I really appreciate it :) [15:01:12] marostegui: thanks for the assist :) [15:01:28] * godog high fives [15:01:39] thcipriani|afk: yw! :) [15:02:20] 10Operations, 10hardware-requests: eqiad: (2) hardware access request for labvirt expansion (labvirt1021 & labvirt1022) - https://phabricator.wikimedia.org/T178937#3759216 (10chasemp) poke @RobH [15:03:41] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1105 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391218 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:04:32] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3682923 (10jcrespo) @Cmjohnson This would be on the top of the db decommissioning stack (which obviously, is not that urgent) so we can get rid of the non-useful aler... [15:04:37] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully pool db1105 in s1 and s2 - T178359 (duration: 00m 45s) [15:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:43] T178359: Support multi-instance on core hosts - https://phabricator.wikimedia.org/T178359 [15:05:44] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1050 - https://phabricator.wikimedia.org/T178162#3759226 (10Marostegui) And let's make sure we mark that bad disk as broken so it is not re-used somewhere else :-) [15:06:40] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1105 in s1 and s2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391218 (https://phabricator.wikimedia.org/T178359) (owner: 10Marostegui) [15:10:24] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3759260 (10chasemp) still seems up and even responsive! [15:11:01] (03PS2) 10Rush: toolserver_legacy: redirect /~nikola/articlesby.php [puppet] - 10https://gerrit.wikimedia.org/r/390883 (https://phabricator.wikimedia.org/T179766) (owner: 10BryanDavis) [15:11:16] (03PS2) 10ArielGlenn: move base::firewall out of dumps/snapshots modules and into roles [puppet] - 10https://gerrit.wikimedia.org/r/391221 [15:11:55] (03CR) 10ArielGlenn: [C: 032] move base::firewall out of dumps/snapshots modules and into roles [puppet] - 10https://gerrit.wikimedia.org/r/391221 (owner: 10ArielGlenn) [15:14:06] (03CR) 10Rush: [C: 032] toolserver_legacy: redirect /~nikola/articlesby.php [puppet] - 10https://gerrit.wikimedia.org/r/390883 (https://phabricator.wikimedia.org/T179766) (owner: 10BryanDavis) [15:14:10] (03PS3) 10Rush: toolserver_legacy: redirect /~nikola/articlesby.php [puppet] - 10https://gerrit.wikimedia.org/r/390883 (https://phabricator.wikimedia.org/T179766) (owner: 10BryanDavis) [15:16:10] (03PS2) 10Muehlenhoff: Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/391207 [15:17:20] (03CR) 10Muehlenhoff: [C: 032] Remove further package leftovers after jessie->stretch upgrades [puppet] - 10https://gerrit.wikimedia.org/r/391207 (owner: 10Muehlenhoff) [15:17:49] PROBLEM - DPKG on labsdb1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:19:40] (03PS1) 10ArielGlenn: move hardcoded paths out of dump rsync server manifests [puppet] - 10https://gerrit.wikimedia.org/r/391222 (https://phabricator.wikimedia.org/T171541) [15:20:06] (03CR) 10jerkins-bot: [V: 04-1] move hardcoded paths out of dump rsync server manifests [puppet] - 10https://gerrit.wikimedia.org/r/391222 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [15:21:12] (03PS1) 10Muehlenhoff: Update Cumin alias for maps-test [puppet] - 10https://gerrit.wikimedia.org/r/391223 [15:21:31] Hi [15:21:42] This patch https://gerrit.wikimedia.org/r/#/c/391079/ to I add in deployments list or? [15:22:05] (03CR) 10Muehlenhoff: [C: 032] Update Cumin alias for maps-test [puppet] - 10https://gerrit.wikimedia.org/r/391223 (owner: 10Muehlenhoff) [15:27:27] (03PS2) 10ArielGlenn: move hardcoded paths out of dump rsync server manifests [puppet] - 10https://gerrit.wikimedia.org/r/391222 (https://phabricator.wikimedia.org/T171541) [15:28:36] 10Puppet, 10Cloud-VPS, 10cloud-services-team (Kanban): role::puppetmaster::standalone has no firewall rule for port 8140 - https://phabricator.wikimedia.org/T154150#3759311 (10bd808) [15:31:48] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [15:33:48] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [15:34:40] (03PS1) 10Filippo Giunchedi: add prometheus to k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/391224 [15:36:48] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] add prometheus to k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/391224 (owner: 10Filippo Giunchedi) [15:38:49] (03PS3) 10ArielGlenn: move hardcoded paths out of dump rsync server manifests [puppet] - 10https://gerrit.wikimedia.org/r/391222 (https://phabricator.wikimedia.org/T171541) [15:40:09] (03PS5) 10Ema: vcl: distinguish between hfp and hfm [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) [15:41:33] (03PS3) 10BBlack: [WIP] normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [15:51:36] (03PS4) 10ArielGlenn: move hardcoded paths out of dump rsync server manifests [puppet] - 10https://gerrit.wikimedia.org/r/391222 (https://phabricator.wikimedia.org/T171541) [15:52:19] (03CR) 10ArielGlenn: [C: 032] move hardcoded paths out of dump rsync server manifests [puppet] - 10https://gerrit.wikimedia.org/r/391222 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [15:52:22] (03CR) 10Ema: "pcc output looks sane to me: https://puppet-compiler.wmflabs.org/compiler03/8768/" [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) (owner: 10Ema) [16:00:07] (03CR) 10Ottomata: "Mainly because it will be easier to automate. I talked to _joe_ about this too. Ideally ya, they'd be in ops/puppet. But, I think we'd " [puppet] - 10https://gerrit.wikimedia.org/r/391214 (owner: 10Ottomata) [16:01:53] (03PS1) 10ArielGlenn: clean up the dumps rsyncer profiles [puppet] - 10https://gerrit.wikimedia.org/r/391230 [16:02:10] (03PS4) 10BBlack: normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [16:03:05] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3759422 (10Tobi_WMDE_SW) [16:04:19] (03PS5) 10BBlack: normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [16:05:05] 10Operations, 10wikidiff2, 10Patch-For-Review, 10User-Addshore, and 2 others: Update and use php-wikidiff2 1.5.1 & MovedParagraphDetectionCutoff in production - https://phabricator.wikimedia.org/T177891#3674094 (10Tobi_WMDE_SW) Before we continue to roll it further out we need to tackle T179277, T166882 an... [16:06:07] Request from 2a02:2149:8648:fc00:98d:47c4:a7b8:4c91 via cp3007 cp3007, Varnish XID 4472375 [16:06:07] Error: 503, Backend fetch failed at Tue, 14 Nov 2017 16:05:37 GMT jenkins console output from puppet compiler, twice in 30 seconds [16:06:24] apergos sometimes that happends. [16:12:08] (03PS2) 10ArielGlenn: clean up the dumps rsyncer profiles [puppet] - 10https://gerrit.wikimedia.org/r/391230 [16:15:29] I'm unused to seeing it back to back [16:15:38] and just now again [16:16:08] apergos: looking [16:16:46] reload "fixes" it of course but... just in case it's something [16:18:12] (03CR) 10ArielGlenn: [C: 032] clean up the dumps rsyncer profiles [puppet] - 10https://gerrit.wikimedia.org/r/391230 (owner: 10ArielGlenn) [16:21:01] (03CR) 10BBlack: vcl: distinguish between hfp and hfm (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/391171 (https://phabricator.wikimedia.org/T180434) (owner: 10Ema) [16:21:37] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3759532 (10Cmjohnson) [16:21:45] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1033 and db1028 - https://phabricator.wikimedia.org/T174076#3759530 (10Cmjohnson) 05Open>03Resolved Wiped, removed from racktables [16:23:10] !log stop labsdb1010 mariadb to clone it later to labsdb1009 T179244 [16:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:16] T179244: labsdb1009 crashed - OOM - https://phabricator.wikimedia.org/T179244 [16:23:35] (03PS1) 10Ema: Revert "cache_misc: disable varnish_be<->varnish_be max_connections" [puppet] - 10https://gerrit.wikimedia.org/r/391232 [16:24:01] (03PS2) 10Dzahn: [Planet Wikimedia] Remove EndPoint from English planet feeds [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [16:24:03] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1035 - https://phabricator.wikimedia.org/T176931#3759565 (10Cmjohnson) 05Open>03Resolved wiped, racktables updated. [16:24:23] 10Operations, 10DBA, 10Patch-For-Review: Decomissions old s2 eqiad hosts (db1018, db1021, db1024, db1036) - https://phabricator.wikimedia.org/T162699#3759570 (10Cmjohnson) [16:24:24] (03PS2) 10Ema: Revert "cache_misc: disable varnish_be<->varnish_be max_connections" [puppet] - 10https://gerrit.wikimedia.org/r/391232 [16:24:26] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: decommission db1036 - https://phabricator.wikimedia.org/T176311#3759568 (10Cmjohnson) 05Open>03Resolved wiped, racktables updated [16:24:42] (03CR) 10Ema: [V: 032 C: 032] Revert "cache_misc: disable varnish_be<->varnish_be max_connections" [puppet] - 10https://gerrit.wikimedia.org/r/391232 (owner: 10Ema) [16:24:46] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1037 - https://phabricator.wikimedia.org/T174902#3759573 (10Cmjohnson) 05Open>03Resolved wiped, racktables updted [16:24:47] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266773 (10Cmjohnson) [16:25:10] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266774 (10Cmjohnson) [16:25:13] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1038 - https://phabricator.wikimedia.org/T177911#3759576 (10Cmjohnson) 05Open>03Resolved Wiped, racktables updated [16:25:34] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266776 (10Cmjohnson) [16:25:37] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1041 - https://phabricator.wikimedia.org/T173915#3759581 (10Cmjohnson) 05Open>03Resolved Wiped, racktables updated [16:26:18] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:26:29] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [16:26:49] ACKNOWLEDGEMENT - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo labsdb1010 down for maintenance [16:26:49] ACKNOWLEDGEMENT - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Jcrespo labsdb1010 down for maintenance [16:28:08] (03PS3) 10Dzahn: [Planet Wikimedia] Remove EndPoint from English planet feeds [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [16:29:21] (03CR) 10Dzahn: [C: 032] [Planet Wikimedia] Remove EndPoint from English planet feeds [puppet] - 10https://gerrit.wikimedia.org/r/390873 (owner: 10Nemo bis) [16:31:32] (03PS1) 10Ema: cache: set fe->be and be->be max_connections to 50k [puppet] - 10https://gerrit.wikimedia.org/r/391233 [16:32:04] (03CR) 10BBlack: [C: 031] cache: set fe->be and be->be max_connections to 50k [puppet] - 10https://gerrit.wikimedia.org/r/391233 (owner: 10Ema) [16:32:23] !log Restarting apache2 on phab1001 (deploy phabricator hotfix: D876 ) [16:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:30] D876: Change call to withFilterPHIDs after method rename - https://phabricator.wikimedia.org/D876 [16:32:37] (03CR) 10Ema: [C: 032] cache: set fe->be and be->be max_connections to 50k [puppet] - 10https://gerrit.wikimedia.org/r/391233 (owner: 10Ema) [16:34:13] (03PS1) 10Cmjohnson: Removing site.pp and dhcp entries for decom host mw1161-69 T177387 [puppet] - 10https://gerrit.wikimedia.org/r/391234 [16:35:33] Amir1, apergos: the issues affecting cp3007 should be fixed now (https://gerrit.wikimedia.org/r/#/c/391233/). Please let #wikimedia-traffic know if that's not the case! [16:37:09] PROBLEM - puppetmaster backend https on puppetmaster2002 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [16:37:44] herron: is that you? ^^^ [16:37:59] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3759651 (10RStallman-legalteam) @MoritzMuehlenhoff @EBjune - Kirk has signed as well now, so the NDAs are completed... [16:38:20] volans yes sort of, the check needs to be updated for puppet 4. downtime must have just expired [16:38:26] ok [16:38:29] interesting! thanks, I'll keep it in mind [16:38:53] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: Decomission mw1161-69 - https://phabricator.wikimedia.org/T177387#3759656 (10Cmjohnson) @elukey or @Joe I went to finish the decom and found that 2 host still show up in puppet. please give me the okay to proceed. modules/servi... [16:41:37] (03CR) 10Jforrester: [C: 031] "This has my "blessing" from a Beta Features POV. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/390386 (https://phabricator.wikimedia.org/T180147) (owner: 10Addshore) [16:41:55] (03PS1) 10Volans: Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) [16:41:56] (03PS1) 10Volans: Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) [16:41:58] (03PS1) 10Volans: Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [16:42:01] (03PS1) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [16:42:07] let's see how much jekins hates me... [16:42:29] (03CR) 10jerkins-bot: [V: 04-1] Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [16:44:14] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3759699 (10Paladox) [16:46:43] (03PS2) 10Volans: Icinga: allow to set display_name [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) [16:46:45] (03PS2) 10Volans: Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) [16:46:47] (03PS2) 10Volans: Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [16:46:49] (03PS2) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [16:46:52] (03PS1) 10Filippo Giunchedi: standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) [16:46:54] (03PS1) 10Filippo Giunchedi: wmflib: switch to Standard::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) [16:47:07] * godog standing by for jenkins' -1 [16:47:20] (03PS1) 10ArielGlenn: add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) [16:47:25] godog: me too :D [16:47:32] (03CR) 10jerkins-bot: [V: 04-1] standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [16:47:42] heheh volans QED [16:47:47] yeah [16:48:25] (03PS2) 10Hoo man: Include php5 packages on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/391045 [16:48:50] (03CR) 10jerkins-bot: [V: 04-1] Include php5 packages on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [16:49:26] (03PS2) 10Filippo Giunchedi: standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) [16:49:28] (03PS2) 10Filippo Giunchedi: wmflib: switch to Standard::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) [16:49:57] (03CR) 10jerkins-bot: [V: 04-1] standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [16:51:29] (03PS3) 10Filippo Giunchedi: standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) [16:51:31] (03PS3) 10Filippo Giunchedi: wmflib: switch to Standard::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) [16:52:52] godog: if you just need cluster and site exported [16:53:00] they are already in profile::cumin::target [16:54:00] "exported" in the sense of querable by puppetdb [16:54:07] not in the sense of exported resources [16:55:00] volans: ah! interesting, yeah querable by puppetdb would be enough in this case, I'll give it a try [16:55:45] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: Decomission mw1161-69 - https://phabricator.wikimedia.org/T177387#3759770 (10Cmjohnson) [16:55:52] (03CR) 10Volans: "NOOP according to puppet compiler: https://puppet-compiler.wmflabs.org/compiler03/8772/" [puppet] - 10https://gerrit.wikimedia.org/r/391235 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [16:57:42] (03PS3) 10Hoo man: Include php5 packages on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/391045 [16:59:59] 10Operations, 10Wikimedia-Planet: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759794 (10Dzahn) [17:00:04] godog, moritzm, and _joe_: How many deployers does it take to do Puppet SWAT(Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171114T1700). [17:00:05] Zoranzoki21: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:01:22] That's in the wrong window [17:02:49] indeed [17:02:56] (03CR) 10Volans: "Compiler results: https://puppet-compiler.wmflabs.org/compiler02/8773/" [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [17:02:57] * godog closes the window [17:03:14] 10Operations, 10Wikimedia-Planet: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759814 (10Dzahn) [17:03:28] (03PS4) 10Filippo Giunchedi: wmflib: switch away from Ganglia::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) [17:03:57] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3759815 (10Paladox) [17:04:19] 10Operations, 10Wikimedia-Planet: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759817 (10Paladox) [17:05:01] (03PS1) 10Addshore: Stop using extension-list-wikidata from Wikidata build for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391245 (https://phabricator.wikimedia.org/T177060) [17:05:32] there is a change in Puppet SWAT but it doesnt belong there because it's mw-config ? [17:05:38] (03PS1) 10Cmjohnson: Removing dns entries for decom hosts mw1161-69 T177387 [dns] - 10https://gerrit.wikimedia.org/r/391247 [17:05:41] yeah [17:05:51] mutante: see 4 minutes ago ;) [17:05:52] (03CR) 10Cmjohnson: [C: 032] Removing site.pp and dhcp entries for decom host mw1161-69 T177387 [puppet] - 10https://gerrit.wikimedia.org/r/391234 (owner: 10Cmjohnson) [17:05:59] (03PS2) 10Cmjohnson: Removing site.pp and dhcp entries for decom host mw1161-69 T177387 [puppet] - 10https://gerrit.wikimedia.org/r/391234 [17:06:03] (03CR) 10jerkins-bot: [V: 04-1] Stop using extension-list-wikidata from Wikidata build for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391245 (https://phabricator.wikimedia.org/T177060) (owner: 10Addshore) [17:06:09] (03CR) 10Cmjohnson: [V: 032 C: 032] Removing site.pp and dhcp entries for decom host mw1161-69 T177387 [puppet] - 10https://gerrit.wikimedia.org/r/391234 (owner: 10Cmjohnson) [17:06:28] :) hi and gotcha [17:06:42] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom hosts mw1161-69 T177387 [dns] - 10https://gerrit.wikimedia.org/r/391247 (owner: 10Cmjohnson) [17:06:49] (03PS2) 10Addshore: Stop using extension-list-wikidata from Wikidata build for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391245 (https://phabricator.wikimedia.org/T177060) [17:07:00] 10Operations, 10Wikimedia-Planet: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759826 (10Paladox) [17:07:08] 10Operations, 10Wikimedia-Planet: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759827 (10Dzahn) a:05Dzahn>03None [17:07:29] hi! btw mutante I think I figured out what was wrong with T180256 [17:07:29] T180256: authdns prometheus metrics are not available anymore - https://phabricator.wikimedia.org/T180256 [17:08:06] (03PS2) 10Cmjohnson: Removing dns entries for decom hosts mw1161-69 T177387 [dns] - 10https://gerrit.wikimedia.org/r/391247 [17:08:08] (03PS5) 10Filippo Giunchedi: wmflib: switch away from Ganglia::Cluster to get cluster mappings [puppet] - 10https://gerrit.wikimedia.org/r/391241 (https://phabricator.wikimedia.org/T180256) [17:08:10] (03PS4) 10Filippo Giunchedi: standard: map an host to its production cluster [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) [17:09:00] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for decom hosts mw1161-69 T177387 [dns] - 10https://gerrit.wikimedia.org/r/391247 (owner: 10Cmjohnson) [17:09:05] (03CR) 10Addshore: [C: 032] Stop using extension-list-wikidata from Wikidata build for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391245 (https://phabricator.wikimedia.org/T177060) (owner: 10Addshore) [17:10:18] (03Merged) 10jenkins-bot: Stop using extension-list-wikidata from Wikidata build for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391245 (https://phabricator.wikimedia.org/T177060) (owner: 10Addshore) [17:10:25] (03CR) 10jenkins-bot: Stop using extension-list-wikidata from Wikidata build for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391245 (https://phabricator.wikimedia.org/T177060) (owner: 10Addshore) [17:10:37] 10Operations, 10Wikimedia-Planet: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3759850 (10Dzahn) [17:10:56] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.6 [keeping static files] (duration: 02m 48s) [17:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:58] !log demon@tin Synchronized wmf-config/extension-list-labs: No-op (duration: 00m 44s) [17:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:07] godog: ooh, i had not seen that.. ! that's what i needed for appservers too.. reading [17:12:54] !log demon@tin Synchronized wmf-config/CommonSettings.php: Beta-only, no-op (duration: 00m 43s) [17:12:55] cumin targets.. aha [17:12:57] (03PS2) 10ArielGlenn: add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) [17:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:59] (03PS1) 10Addshore: Stop using extension-list-wikidata from Wikidat build for prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391251 (https://phabricator.wikimedia.org/T177060) [17:19:08] volans: puppetdb is unhappy now afaics but yeah then https://gerrit.wikimedia.org/r/#/c/391241/ alone is what we would need I think [17:19:31] unhappy how? :D [17:19:46] https://puppet-compiler.wmflabs.org/compiler02/8776/prometheus1003.eqiad.wmnet/prod.prometheus1003.eqiad.wmnet.err [17:20:05] do we have a puppetdb for the compiler? [17:20:21] yeah, but that's a rabbithole for tomorrow [17:21:06] I'm not sure if it ever worked [17:21:11] (03CR) 10Dzahn: [C: 031] "this would be great as a fix for T180256 and also make it possible to merge similar changes to decom Ganglia that affect appservers" [puppet] - 10https://gerrit.wikimedia.org/r/391240 (https://phabricator.wikimedia.org/T180256) (owner: 10Filippo Giunchedi) [17:22:47] I think it did in the past, I remember testing prometheus changes [17:23:40] good to know :) [17:24:08] (03PS2) 10Ayounsi: [WIP] Have every rdns advertise a private anycast VIP [puppet] - 10https://gerrit.wikimedia.org/r/391149 [17:25:32] RECOVERY - DPKG on labsdb1011 is OK: All packages OK [17:27:32] (03PS15) 10Elukey: First commit [software/druid_exporter] - 10https://gerrit.wikimedia.org/r/389475 (https://phabricator.wikimedia.org/T177459) [17:27:48] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.3 (duration: 07m 58s) [17:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:43] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3759987 (10Dzahn) [17:29:25] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366199 (10Dzahn) [17:29:28] 10Operations, 10Trending-Service, 10Reading-Infrastructure-Team-Backlog (Kanban), 10Services (designing): Turn off Trending Service - https://phabricator.wikimedia.org/T180384#3755911 (10Ottomata) Others have also asked for a limited Kafka Mirror available in Cloud VPS somewhere. I'm not opposed, we'd jus... [17:35:19] (03PS3) 10ArielGlenn: add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) [17:35:51] (03CR) 10jerkins-bot: [V: 04-1] add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [17:35:53] (03CR) 10Ottomata: "I think ideally, the secret() function would work more like the template() function when looking up file paths, with the module name requi" [puppet] - 10https://gerrit.wikimedia.org/r/391214 (owner: 10Ottomata) [17:36:51] (03PS3) 10Volans: Metric alarms: add link to the Grafana dashboard [puppet] - 10https://gerrit.wikimedia.org/r/391236 (https://phabricator.wikimedia.org/T170353) [17:36:54] (03PS3) 10Volans: Icinga notification: use display_name in messages [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) [17:36:55] (03PS3) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [17:37:41] (03PS4) 10ArielGlenn: add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) [17:38:02] (03CR) 10Volans: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/8774/" [puppet] - 10https://gerrit.wikimedia.org/r/391237 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [17:42:03] !log smalyshev@tin Started deploy [wdqs/wdqs@b44cf27]: data reload/T176593 [17:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:19] !log smalyshev@tin Finished deploy [wdqs/wdqs@b44cf27]: data reload/T176593 (duration: 00m 16s) [17:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:40] (03PS5) 10ArielGlenn: add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) [17:44:15] !log restarted puppetdb on nitrogen and nihal to pick up jre updates [17:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:37] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:45:56] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:46:57] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:47:11] (03PS1) 10Chad: group0 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391258 [17:47:13] (03CR) 10Chad: [C: 04-2] group0 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391258 (owner: 10Chad) [17:47:26] PROBLEM - puppet last run on aqs1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:07] PROBLEM - puppet last run on mc2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:48:13] ^ those puppet alerts are likely related to puppetdb restart and should clear on next run [17:48:57] PROBLEM - puppet last run on ms-be1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:26] PROBLEM - puppet last run on mc2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:26] PROBLEM - puppet last run on analytics1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:47] PROBLEM - puppet last run on analytics1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:56] PROBLEM - puppet last run on darmstadtium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:52:46] (03PS6) 10ArielGlenn: add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) [17:53:54] (03CR) 10ArielGlenn: [C: 032] add dumps rsync user and path variables to hiera [puppet] - 10https://gerrit.wikimedia.org/r/391242 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [17:56:05] !log ebernhardson@tin Started deploy [search/MjoLniR/deploy@607adfb]: (no justification provided) [17:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:19] !log ebernhardson@tin Finished deploy [search/MjoLniR/deploy@607adfb]: (no justification provided) (duration: 00m 14s) [17:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:49] !log ebernhardson@tin Started deploy [search/mjolnir/deploy@ceb5c2f]: (no justification provided) [17:57:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:09] !log ebernhardson@tin Finished deploy [search/mjolnir/deploy@ceb5c2f]: (no justification provided) (duration: 00m 20s) [17:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:43] (03PS4) 10Volans: Metric alarms: make link to Grafana mandatory [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) [17:58:57] RECOVERY - puppet last run on ms-be1029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:59:26] RECOVERY - puppet last run on mc2021 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:59:27] RECOVERY - puppet last run on analytics1052 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / OCG / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171114T1800). [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:12] (03PS1) 10ArielGlenn: enable dump rsyncs to/from labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) [18:00:25] Nothing for ORES [18:00:57] arlo will be deploy parsoid [18:01:19] (03CR) 10jerkins-bot: [V: 04-1] enable dump rsyncs to/from labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [18:01:50] (03PS2) 10ArielGlenn: enable dump rsyncs to/from labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) [18:02:22] !log mholloway-shell@tin Started deploy [mobileapps/deploy@9b10959]: Redeploying: Update mobileapps to c002862 [18:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:11] !log ebernhardson@tin Started deploy [search/mjolnir/deploy@cd6ddda]: (no justification provided) [18:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:47] !log rebooting notebook* hosts for update to 4.9.51 [18:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:32] (03PS3) 10Ayounsi: [WIP] Have every rdns advertise a private anycast VIP [puppet] - 10https://gerrit.wikimedia.org/r/391149 [18:08:40] !log ebernhardson@tin Finished deploy [search/mjolnir/deploy@cd6ddda]: (no justification provided) (duration: 04m 28s) [18:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:34] (03CR) 10Volans: "Compiler results: https://puppet-compiler.wmflabs.org/compiler02/8781/" [puppet] - 10https://gerrit.wikimedia.org/r/391238 (https://phabricator.wikimedia.org/T170353) (owner: 10Volans) [18:13:24] !log arlolra@tin Started deploy [parsoid/deploy@b150764]: Updating Parsoid to e71937d0 [18:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:46] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3760163 (10debt) [18:15:42] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3686630 (10debt) @gehel will check with JClarity to see if they still need connections to our servers, otherwise we... [18:15:59] 10Operations, 10Ops-Access-Requests, 10Discovery, 10Wikidata, and 3 others: Allow Kirk and Martijn (JClarity) access to our WDQS production servers - https://phabricator.wikimedia.org/T178271#3760166 (10EBjune) a:05EBjune>03Gehel Passing it to @Gehel [18:18:51] (03CR) 10BBlack: [WIP] Have every rdns advertise a private anycast VIP (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/391149 (owner: 10Ayounsi) [18:19:57] !log ebernhardson@tin Started deploy [search/mjolnir/deploy@e6905f4]: test mjolnir deployment [18:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:30] (03PS1) 10Addshore: Remove db_slave entry from statistics wmde graphite [puppet] - 10https://gerrit.wikimedia.org/r/391269 (https://phabricator.wikimedia.org/T180025) [18:21:10] !log ebernhardson@tin Finished deploy [search/mjolnir/deploy@e6905f4]: test mjolnir deployment (duration: 01m 13s) [18:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:41] (03PS3) 10ArielGlenn: enable dump rsyncs to/from labstore1006 [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) [18:22:38] !log arlolra@tin Finished deploy [parsoid/deploy@b150764]: Updating Parsoid to e71937d0 (duration: 09m 13s) [18:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:03] !log Upgraded notebook1001 and 1002 to kernel version 4.9.51-1~bpo8+1 [18:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:24] !log upgraded nginx on notebook* to 1.13.6 [18:28:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:34] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@9b10959]: Redeploying: Update mobileapps to c002862 (duration: 29m 12s) [18:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:12] !log Updated Parsoid to e71937d0 (T178253) [18:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:18] T178253: Figure handler rejects nested tables in figure captions - https://phabricator.wikimedia.org/T178253 [18:33:16] PROBLEM - cassandra-c service on restbase2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [18:33:17] PROBLEM - cassandra-c SSL 10.192.16.167:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:33:26] PROBLEM - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [18:33:27] PROBLEM - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.166 and port 9042: Connection refused [18:33:37] PROBLEM - cassandra-c CQL 10.192.16.167:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.167 and port 9042: Connection refused [18:33:37] PROBLEM - cassandra-b service on restbase2002 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [18:33:56] PROBLEM - Check systemd state on restbase2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:34:07] PROBLEM - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is CRITICAL: connect to address 10.192.16.165 and port 9042: Connection refused [18:34:58] (03CR) 10Ottomata: [C: 032] Remove db_slave entry from statistics wmde graphite [puppet] - 10https://gerrit.wikimedia.org/r/391269 (https://phabricator.wikimedia.org/T180025) (owner: 10Addshore) [18:35:24] (03CR) 10ArielGlenn: "Not yet ready, but I have a couple questions:" [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [18:51:59] (03CR) 10Madhuvishy: "On base::firewall, yes we can! We'll need to poke holes later, but that's fine." [puppet] - 10https://gerrit.wikimedia.org/r/391263 (https://phabricator.wikimedia.org/T171541) (owner: 10ArielGlenn) [18:57:16] (03CR) 10Dzahn: puppetdb: Allow customising username and password for active record (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [19:00:47] (03PS1) 10Jcrespo: mariadb: Depool db2034 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391274 [19:01:43] (03CR) 10Chad: puppetdb: Allow customising username and password for active record (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/390504 (owner: 10Paladox) [19:02:39] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2034 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391274 (owner: 10Jcrespo) [19:02:56] !log demon@tin Started scap: bootstrap wmf.8 [19:02:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3760325 (10bd808) [19:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:02] rb2002 cass issues are expected ^ [19:03:57] (03Merged) 10jenkins-bot: mariadb: Depool db2034 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391274 (owner: 10Jcrespo) [19:06:30] (03CR) 10jenkins-bot: mariadb: Depool db2034 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391274 (owner: 10Jcrespo) [19:07:11] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2034 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391275 [19:09:27] !log for i in `OS_TENANT_NAME=testlabs openstack server list | grep stress | awk '{print $2}'`; do echo $i; OS_TENANT_NAME=testlabs openstack server delete $i; sleep 30; done T171473 [19:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:34] T171473: labvirt1015 crashes - https://phabricator.wikimedia.org/T171473 [19:10:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3760351 (10chasemp) After 4 rounds I am going to purge the stress test instances and see about cycling this into some real world load. [19:10:48] !log restart db2034 for mariadb upgrade [19:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:27] 10Operations, 10Traffic, 10netops, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3749534 (10chasemp) We are dealing with a bit of a resource crunch (T161118, T171473, T178937, etc) and need to rebalance and do some rounds of... [19:16:09] (03PS6) 10BBlack: normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [19:17:40] (03PS1) 10Dzahn: planet: small style change for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/391276 [19:19:17] RECOVERY - cassandra-a CQL 10.192.16.165:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.165 port 9042 [19:19:47] 10Operations, 10Traffic, 10netops, 10Cloud-VPS (Quota-requests): Request increased quota for traffic Cloud VPS project - https://phabricator.wikimedia.org/T180178#3749534 (10bd808) +1 once labvirt1015 is online [19:19:51] (03PS2) 10Dzahn: planet: small style change for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/391276 [19:19:57] (03PS3) 10Dzahn: planet: small style change for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/391276 [19:21:04] (03CR) 10Dzahn: [C: 032] planet: small style change for rawdog [puppet] - 10https://gerrit.wikimedia.org/r/391276 (owner: 10Dzahn) [19:23:53] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2034 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391275 (owner: 10Jcrespo) [19:25:05] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2034 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391275 (owner: 10Jcrespo) [19:26:21] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2034 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391275 (owner: 10Jcrespo) [19:26:36] demon: can I ask you to rebase when finished? there are 2 noop changes on mediawiki config - https://gerrit.wikimedia.org/r/391275 [19:27:48] make sure tin is on 3c0763a7329a392df3 (at the time of writing this), it is a noop [19:32:24] !log clean up instances in error state in testlabs project [19:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:23] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3760561 (10RobH) Ok, did the following: * pulled cpu1 entirely because I didnt want to waste thermal compund swapping it to cpu 2. * put suspected cpu 2 into cpu 1 socket * installed os, got cpu error duri... [19:44:41] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3760567 (10RobH) a:05RobH>03BBlack @BBlack: Assignign this back to you, please reimage or place this system back into service as you see fit. The CPU error hasn't shown back up during the OS install si... [19:47:33] !log demon@tin Finished scap: bootstrap wmf.8 (duration: 44m 37s) [19:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:20] !log ebernhardson@tin Started deploy [search/mjolnir/deploy@b20f0da]: test mjolnir deployment [19:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:50] !log Bootstrapping restbase2002-b.codfw.wmnet (T179422) [19:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:57] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [19:58:32] !log ebernhardson@tin Finished deploy [search/mjolnir/deploy@b20f0da]: test mjolnir deployment (duration: 02m 12s) [19:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] no_justification: I, the Bot under the Fountain, allow thee, The Deployer, to do MediaWiki train deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20171114T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:46] RECOVERY - cassandra-b SSL 10.192.16.166:7001 on restbase2002 is OK: SSL OK - Certificate restbase2002-b valid until 2018-08-17 16:11:45 +0000 (expires in 275 days) [20:01:03] (03CR) 10Chad: [C: 032] group0 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391258 (owner: 10Chad) [20:01:16] RECOVERY - cassandra-b service on restbase2002 is OK: OK - cassandra-b is active [20:02:20] (03Merged) 10jenkins-bot: group0 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391258 (owner: 10Chad) [20:02:33] (03CR) 10jenkins-bot: group0 to wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391258 (owner: 10Chad) [20:04:58] (03PS1) 10BBlack: Revert "Remove borked cp4024 from ipsec nodelists" [puppet] - 10https://gerrit.wikimedia.org/r/391287 [20:05:17] (03CR) 10BBlack: [V: 032 C: 032] Revert "Remove borked cp4024 from ipsec nodelists" [puppet] - 10https://gerrit.wikimedia.org/r/391287 (owner: 10BBlack) [20:07:29] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3760577 (10BBlack) For now I'm puppetizing it back into the cluster (and ipsec lists), but not repooling yet... [20:15:05] !log reboot cp4024 [20:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:56] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3760612 (10dduvall) [20:25:12] (03PS1) 10Dzahn: planet: add Wikimedia Community Logo to rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391289 (https://phabricator.wikimedia.org/T180498) [20:27:01] (03PS2) 10Dzahn: planet: add Wikimedia Community Logo to rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391289 (https://phabricator.wikimedia.org/T180498) [20:28:09] (03CR) 10Dzahn: [C: 032] planet: add Wikimedia Community Logo to rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391289 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [20:28:14] (03PS3) 10Dzahn: planet: add Wikimedia Community Logo to rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391289 (https://phabricator.wikimedia.org/T180498) [20:29:14] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: planet.wikimedia.org: replace planet-venus software with rawdog - https://phabricator.wikimedia.org/T180498#3760657 (10Dzahn) [20:33:10] (03CR) 10Paladox: planet: add Wikimedia Community Logo to rawdog style (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391289 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [20:34:50] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 48 not-conn: cp1062_v4, cp1062_v6, cp1073_v4, cp1073_v6,kafka1020_v4,kafka1020_v6 [20:35:00] (03PS7) 10BBlack: normalize_path: fully normalize MW+RB URL paths [puppet] - 10https://gerrit.wikimedia.org/r/391216 (https://phabricator.wikimedia.org/T127387) [20:36:50] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 54 ESP OK [20:37:52] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.8 [20:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:19] (03CR) 10Dzahn: planet: add Wikimedia Community Logo to rawdog style (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391289 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [20:42:30] paladox: don't care about IE 8 :) [20:42:46] like.. arent they excluded anyways, heh [20:44:42] Is phab down or is it just me? "833 packets transmitted, 0 received, 100% packet loss, time 843101ms" [20:44:55] eddiegp: phab works for me [20:45:31] eddiegp: Whereabooots are you? [20:45:38] Northen Germany [20:45:44] Are other wikimedia properties working? [20:45:47] Gerrit, wikipedia etc [20:46:25] mutante: FTR, even IE7 isn't excluded for direct connectivity (because IE[78]-for-Win7 exists, and has better crypto than IE[78]-on-XP that we killed support for). Also, technically someone can still connect to us with IE6, if they use some kind of TLS-intercepting proxy to upgrade their outbound crypto. [20:46:59] hopefully the crypto changes alter the client percentages we see for the better, but bottom line is they don't actually completely stop much of anything being *possible* [20:47:21] gerrit and wikitech work, ... [20:47:39] err... s/Win7/Vista/ above. I think Win7 started at IE9? [20:47:56] bblack: *nod*, thank you. this one was about using .svg for the logo, i could provide a .png .. hrm [20:48:18] eddiegp: try something else that is also behind "misc-web". for example https://annual.wikimedia.org [20:48:31] Hmm, wikipedia, wikivoyage, ... don't work either. [20:48:54] eddiegp: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [20:48:56] traceroute: https://pastebin.com/EVdY3NCi [20:49:12] eddiegp: try "mtr" [20:49:18] (instead of traceroute) [20:49:46] but fwiw, the traceroute dies in Versatel [20:49:57] Not so versa [20:52:51] bblack: mtr shows the same route [20:53:15] where does it show the loss? [20:53:38] also, I just did some probes towards your versatel IPs from esams, looks healthy at present? [20:53:49] Just came back [20:53:58] So works again, hmm... [20:54:15] it moved routes just as I was looking, too [20:54:30] from what looked like a direct peering, to transiting via datahop [20:54:53] mtr was -exactly- step for step the same as traceroute, failed after the fifth host 62.214.37.202 fwiw [20:55:46] maybe something went wrong with versatel's port at AMS-IX, or specifically their peering with us there, but either way it seems to have fixed itself via transits now [20:56:52] Yeah, seems so. Thanks anyways :) [21:01:20] !log ebernhardson@tin Started deploy [search/mjolnir/deploy@77310bc]: update mjolnir to master [21:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:25] !log ebernhardson@tin Finished deploy [search/mjolnir/deploy@77310bc]: update mjolnir to master (duration: 02m 05s) [21:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:33] (03CR) 10Dzahn: [C: 031] "back in this timezone, i can do this now if you like" [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [21:09:08] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Watching / External): Update Debian package for Blubber - https://phabricator.wikimedia.org/T179984#3760776 (10thcipriani) [21:09:36] mutante: Was gonna step away for a smoke break, but I can do it with ya right after [21:11:24] bblack: fwiw, it's back down. But still failing within versatel, so I guess not much to do about it on wm's side. [21:12:14] Ring your local indian technical support out of hours? :P [21:13:26] Well, I had to phone them a few times before... I guess shutting down my computer and try again tomorrow will be more senseful. ;) [21:13:53] I gave up with front line tech support at one ISP [21:14:01] XioNoX: ^ if you're around, something up with Versatel peering failing at AMS-IX (and flapping over to transit at least some of the time?) [21:14:02] because they didn't know who/what Telia were [21:14:16] hey [21:15:01] XioNoX: the original paste with some versatel IPs in it at: https://pastebin.com/EVdY3NCi [21:15:12] reading scrollback [21:15:27] Reedy: Good thing is, at the point they ask about my operating system and get "Linux" as a response I usually get escalated :D [21:15:33] !log ebernhardson@tin Started deploy [search/mjolnir/deploy@494e0c6]: redeploy mjolnir submodule bump to master [21:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:45] !log ebernhardson@tin Finished deploy [search/mjolnir/deploy@494e0c6]: redeploy mjolnir submodule bump to master (duration: 02m 11s) [21:17:49] mtr traceroute, as bblack asked for it earlier: https://pastebin.com/YRcaJe3n [21:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:01] we don't peer directly with versatel but we all peer with the route server [21:19:18] Sigh, I guess I'll call them, although I expect it to be worthless. [21:19:23] eddiegp: what's your IP? (pm is fine) [21:19:54] on that mtr the issue seems to start at the interco between Datahop and versatel [21:20:00] Reedy: find their technicans on twitter/linkedin? [21:20:05] XioNoX: 94.134.229.67 [21:20:33] (03PS3) 10Ottomata: [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) [21:20:59] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:21:01] (03PS4) 10Ottomata: [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) [21:21:28] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:21:34] (03PS5) 10Ottomata: [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) [21:21:57] no_justification: ok, let's get it done :) [21:22:00] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add cergen module [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:22:22] versatel.de has address 82.140.32.137 [21:22:22] ayounsi@bast3002:~$ mtr 82.140.32.137 -z --report-wide [21:22:22] 1. AS14907 ae1-100.cr2-esams.wikimedia.org 0.0% 10 0.2 0.2 0.2 0.3 0.0 [21:22:22] 2. AS1200 xnl1002aihb001.versatel.de 0.0% 10 0.9 2.4 0.9 13.7 3.9 [21:22:23] 3. AS??? ??? 100.0 10 0.0 0.0 0.0 0.0 0.0 [21:22:25] great [21:23:37] (03CR) 10Ottomata: "Volans, _joe_, just wanna see what you think about this approach in general. Should I keep trying?" [puppet] - 10https://gerrit.wikimedia.org/r/391134 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [21:24:12] ottomata: I can have a look tomorrow [21:24:26] danke [21:24:28] its nuts. [21:24:29] haha [21:25:01] lol [21:27:58] eddiegp: I'll email their support, pm me your email if you want me to CC you [21:28:50] XioNoX: That's wikimedia.org at eddie-sh.de (it's on gerrit anyways) [21:31:59] (03PS1) 10Dzahn: planet: turn off day/time sections in rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391323 [21:32:15] (03CR) 10jerkins-bot: [V: 04-1] planet: turn off day/time sections in rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391323 (owner: 10Dzahn) [21:32:27] (03PS2) 10Dzahn: planet: turn off day/time sections in rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391323 [21:33:06] mutante: Sounds good, leggo [21:33:11] (03PS3) 10Dzahn: planet: turn off day/time sections in rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391323 (https://phabricator.wikimedia.org/T180498) [21:33:20] (03PS16) 10Dzahn: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [21:33:58] (03CR) 10Dzahn: [C: 032] planet: turn off day/time sections in rawdog style [puppet] - 10https://gerrit.wikimedia.org/r/391323 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [21:34:08] (03CR) 10Dzahn: [C: 032] Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [21:34:17] (03PS17) 10Dzahn: Gerrit: Remove ldap user and password from secure.config [puppet] - 10https://gerrit.wikimedia.org/r/366910 (owner: 10Paladox) [21:34:26] mutante: Temporarily disabled puppet on cobalt so we can manage it by hand [21:34:33] no_justification: ok, great! [21:35:16] no_justification: merged on master [21:35:46] eddiegp: is it working again? I can't reproduce the issue now... [21:36:14] XioNoX: Yeah, it's back. [21:36:29] no_justification: applied on gerrit2001 [21:36:33] The question is for how long ... ;) [21:37:13] eddiegp: has it been on and off? [21:37:34] Yes, see the backscroll. [21:37:54] mutante: Ok, gonna enable puppet on cobalt, run it, then restart service & pray [21:38:00] !log gerrit2001 - restarted gerrit [21:38:04] no_justification: ok :) [21:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:10] 21:53:57 eddiegp | So works again, hmm... , 22:11:23 eddiegp | fwiw, it's back down. [21:38:26] XioNoX: ^ [21:38:40] kk [21:41:02] Is gerrit down? [21:41:13] i'm getting 503's trying to clone... [21:41:18] same [21:41:20] fatal: unable to access 'https://gerrit.wikimedia.org/r/p/mediawiki/core/': The requested URL returned error: 503 [21:41:36] Seems back now [21:41:37] Oh, I see mutante just restarted... [21:41:55] yep, back now [21:42:16] Oh whoops, I typed my !log but forgot to press enter [21:42:24] * Reedy is in desperate need of more copies of mw core [21:42:28] the "pray" part worked in 16:37 < no_justification> mutante: Ok, gonna enable puppet on cobalt, run it, then restart service & pray [21:42:29] !log gerrit: restarted services on master/cobalt, things will flap for a second [21:42:31] :) [21:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:17] eddiegp: can you share another traceroute now that it's working fine? [21:43:20] (to compare) [21:43:32] sure [21:43:45] mutante: Login / logout / login worked, also obtained all my correct group memberships [21:43:54] (it was the latter I was most nervous about) [21:44:13] yay!:) nice that this one is done now [21:44:16] paladox: ^ [21:44:27] (03Draft1) 10Paladox: puppetmaster: Use ruby-mysql2 over ruby-mysql [puppet] - 10https://gerrit.wikimedia.org/r/391336 [21:44:28] (03Draft2) 10Paladox: puppetmaster: Use ruby-mysql2 over ruby-mysql [puppet] - 10https://gerrit.wikimedia.org/r/391336 [21:44:31] thanks [21:44:34] :) [21:45:19] (03PS3) 10Rush: passwords: update labs root key for Daniel [labs/private] - 10https://gerrit.wikimedia.org/r/390240 (owner: 10Dzahn) [21:45:22] XioNoX: https://pastebin.com/hL2gRwGk [21:45:59] no_justification: well, cool, this one took a while, we were both paranoid, but that's good :) [21:47:44] (03CR) 10Chad: "Wait, if this goes in the plugins/ directory, we should check it into the deploy repo, not puppet." [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [21:48:25] eddiegp: sent [21:48:27] (03CR) 10Paladox: "> Wait, if this goes in the plugins/ directory, we should check it" [puppet] - 10https://gerrit.wikimedia.org/r/368547 (owner: 10Paladox) [21:48:31] (03CR) 10Rush: [C: 031] passwords: update labs root key for Daniel [labs/private] - 10https://gerrit.wikimedia.org/r/390240 (owner: 10Dzahn) [21:49:10] Thanks, got it :) [21:49:27] even if on our side we force another path, the return traffic will be through the problematic point [21:49:35] * Krenair faxes Reedy copies of mw core [21:49:36] as they don't seem to have any communities [21:50:36] (03CR) 10Dzahn: [V: 032 C: 032] "thanks" [labs/private] - 10https://gerrit.wikimedia.org/r/390240 (owner: 10Dzahn) [21:52:02] (03CR) 10Dzahn: "shouldn't this have some "if stretch" around it?" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [21:52:28] (03Abandoned) 10Chad: Revert "Add symlinks for Debian-packaged Bouncycastle Jars" [puppet] - 10https://gerrit.wikimedia.org/r/350446 (owner: 10Paladox) [21:52:53] copy 4 of MW incoming [21:53:13] (03CR) 10Paladox: "> shouldn't this have some "if stretch" around it?" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [21:53:16] (03CR) 10Ayounsi: [WIP] Have every rdns advertise a private anycast VIP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/391149 (owner: 10Ayounsi) [21:53:27] (03CR) 10Paladox: "https://packages.debian.org/stretch/ruby-mysql2" [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [21:54:18] (03PS4) 10Ayounsi: [WIP] Have every rdns advertise a private anycast VIP [puppet] - 10https://gerrit.wikimedia.org/r/391149 [21:54:32] (03CR) 10Dzahn: [C: 031] "confirmed:) " You may want to prefer the ruby-mysql2 package over the ruby-mysql package, since benchmarks have shown it to be faster, it " [puppet] - 10https://gerrit.wikimedia.org/r/391336 (owner: 10Paladox) [21:59:39] mutante: I'd love to knock out the logstash one next [22:00:48] (03CR) 10Dzahn: [C: 031] Include php5 packages on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/391045 (owner: 10Hoo man) [22:01:19] no_justification: yea, that would be nice. we assume all we needed was a logstash restart? [22:02:42] (03PS7) 10Rush: apt: unattended upgrades for wikimedia packages by default [puppet] - 10https://gerrit.wikimedia.org/r/389480 (https://phabricator.wikimedia.org/T177920) (owner: 10Arturo Borrero Gonzalez) [22:02:47] That and firewall fixes [22:02:49] Im having a hard time testing the logstash one. [22:03:15] though i think it may be because logstash + mw in the same vagrant install is using alot of cpu. [22:03:41] 10Operations, 10monitoring, 10Graphite, 10Performance-Team (Radar): Upgrade to latest Grafana 4.6 - https://phabricator.wikimedia.org/T180428#3760907 (10Krinkle) p:05Triage>03Normal [22:04:35] (03CR) 10Dzahn: [C: 031] Add tgr to deploy-service group [puppet] - 10https://gerrit.wikimedia.org/r/391026 (https://phabricator.wikimedia.org/T180366) (owner: 10Mholloway) [22:05:08] (03PS3) 10Krinkle: [WIP] Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) [22:05:14] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 0 down 0 [22:05:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [22:06:32] (03PS4) 10Krinkle: [WIP] Split profile.php from StartProfiler, and create PhpAutoPrepend.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) [22:06:45] (03CR) 10Dzahn: [C: 032] webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 (owner: 10Krinkle) [22:07:12] (03CR) 10Dzahn: [C: 032] webperf: Refactor tests to directly associate expected data with cases [puppet] - 10https://gerrit.wikimedia.org/r/390083 (owner: 10Krinkle) [22:07:18] (03PS4) 10Dzahn: webperf: Refactor tests to directly associate expected data with cases [puppet] - 10https://gerrit.wikimedia.org/r/390083 (owner: 10Krinkle) [22:07:22] mutante: Thanks :) [22:07:46] Krinkle: Do you think I could safely roll out the docroot/foundation removal now? [22:07:50] yw, it had multiple +1 [22:07:55] no_justification: Yes [22:08:14] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [22:08:31] (03PS5) 10Dzahn: webperf: Record navtiming discards to Graphite, and add is_sane test [puppet] - 10https://gerrit.wikimedia.org/r/390061 (owner: 10Krinkle) [22:08:51] no_justification: Assuming apache has been restarted once across the fleet since? [22:09:52] (03PS5) 10Ayounsi: [WIP] Have every rdns advertise a private anycast VIP [puppet] - 10https://gerrit.wikimedia.org/r/391149 [22:10:18] Krinkle: applied on hafnium [22:12:42] Krinkle: Safe to assume? [22:12:43] Heh [22:12:55] I don't know how often we restart apaches regardless [22:12:58] I don't think very often [22:13:12] (03PS2) 10Dzahn: snapshot: Use --no-cache for dumping Wikidata entities [puppet] - 10https://gerrit.wikimedia.org/r/391053 (https://phabricator.wikimedia.org/T180048) (owner: 10Hoo man) [22:13:49] (03PS3) 10Dzahn: snapshot: Use --no-cache for dumping Wikidata entities [puppet] - 10https://gerrit.wikimedia.org/r/391053 (https://phabricator.wikimedia.org/T180048) (owner: 10Hoo man) [22:15:52] (03CR) 10Dzahn: [C: 032] snapshot: Use --no-cache for dumping Wikidata entities [puppet] - 10https://gerrit.wikimedia.org/r/391053 (https://phabricator.wikimedia.org/T180048) (owner: 10Hoo man) [22:17:48] (03PS3) 10Dzahn: contint: migrate publisher to a profile [puppet] - 10https://gerrit.wikimedia.org/r/386813 (owner: 10Hashar) [22:18:03] (03CR) 10Dzahn: [C: 032] "per "cherrypicked / labs-only / noop" :)" [puppet] - 10https://gerrit.wikimedia.org/r/386813 (owner: 10Hashar) [22:18:23] (03CR) 10Chad: [WIP] Split profile.php from StartProfiler, and create PhpAutoPrepend.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391162 (https://phabricator.wikimedia.org/T180183) (owner: 10Krinkle) [22:19:52] Krinkle: I suppose safest thing to do would be to have an opsen do a rolling restart for us first :) [22:20:16] * Reedy clones his 5th copy of MW core [22:21:10] no_justification: Yeah [22:29:18] (03PS1) 10Chad: Don't bother symlinking mobilelanding.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391349 [22:29:49] (03PS3) 10Paladox: Gerrit: remove libbcprov-java and libbcpkix-java packages [puppet] - 10https://gerrit.wikimedia.org/r/385105 [22:30:50] MW core number 6 [22:30:53] Krinkle: 391349 looks pretty safe to me. https://github.com/search?q=org%3Awikimedia+mobilelanding&type=Code - 2 are false positives, 1 is a comment, two remaining ones are ok [22:34:35] (03Draft1) 10Paladox: contint: Remove ruby2.1 as it is in base package [puppet] - 10https://gerrit.wikimedia.org/r/391351 [22:34:37] (03Draft2) 10Paladox: contint: Remove ruby2.1 as it is in base package [puppet] - 10https://gerrit.wikimedia.org/r/391351 [22:36:00] (03CR) 10Krinkle: [C: 031] Don't bother symlinking mobilelanding.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391349 (owner: 10Chad) [22:36:21] no_justification: Yep, good to go afaics. Double-check on mwdebug? [22:37:05] (03CR) 10Chad: [C: 032] Don't bother symlinking mobilelanding.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391349 (owner: 10Chad) [22:37:19] (03CR) 10jenkins-bot: Don't bother symlinking mobilelanding.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391349 (owner: 10Chad) [22:37:21] I don't think I have the plugin on my new laptop yet :P [22:38:48] (03CR) 10Dzahn: [C: 031] "i think we can ignore this particular -1 from jenkins-bot, better to convert the whole role to profile at once later" [puppet] - 10https://gerrit.wikimedia.org/r/391011 (https://phabricator.wikimedia.org/T179565) (owner: 10Filippo Giunchedi) [22:39:58] (03PS3) 10Paladox: contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 [22:42:27] Krinkle: I guess....it works? [22:42:51] (03CR) 10Dzahn: "how does it fix support for stretch though if it simply skips the entire section?" [puppet] - 10https://gerrit.wikimedia.org/r/391351 (owner: 10Paladox) [22:42:51] no_justification: Is it on mwdebug1002? [22:42:55] Yeah. [22:43:15] Hard to test. For a non-Zero-rated domain on a non-mobile browser, I just end up at www.wikipedia.org [22:43:21] (after browsing to m.wikipedia.org) [22:43:26] I guess that's correct? [22:44:15] no_justification: getting 404 not found on mwdebug1002 [22:44:33] I would expect that from non-m cases like https://en.wiktionary.org/w/mobilelanding.php [22:44:39] but even for https://en.m.wikipedia.org/w/mobilelanding.php? and https://en.zero.wikipedia.org/w/mobilelanding.php? [22:44:40] I get 404 [22:45:29] Reverted on tin, pulling to mwdebug1002 [22:45:41] Oh [22:45:42] righ [22:45:45] not on sub domains [22:45:50] only m and zero itself [22:46:16] didn't test there [22:46:59] Let's try again, and this time I'll be sure to also try the rewrite rule (e.g. / not /w/mobilelanding.php) [22:49:14] (03PS4) 10Paladox: contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 [22:49:45] Krinkle: Re pulled to mwdebug1002 [22:52:59] no_justification: Works the same (positive). LGTM [22:54:01] (03CR) 10Dzahn: "would seem nicer if you leave the default unchanged and check for it actually being stretch - instead of the other way around" [puppet] - 10https://gerrit.wikimedia.org/r/391351 (owner: 10Paladox) [22:54:28] (03PS5) 10Paladox: contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 [22:55:19] !log demon@tin Synchronized docroot/m.wikipedia.org/w/mobilelanding.php: symlink -> real file (duration: 00m 50s) [22:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:47] !log demon@tin Synchronized w/: Removing mobilelanding from global w/, few sites actually need it (duration: 00m 49s) [22:57:49] (03PS6) 10Paladox: contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 [22:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:12] (03CR) 10jerkins-bot: [V: 04-1] contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 (owner: 10Paladox) [22:58:32] no_justification: You know... I don't think this is actually working or being reached [22:58:36] (03CR) 10Dzahn: contint: Fix support for stretch in ruby.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/391351 (owner: 10Paladox) [22:58:39] aside from direct /w/mobilelanding.php previously [22:58:56] The code in that file and in ZeroBanner never points to www [22:59:11] The reason we're getting the www redirect is because varnish rewrites m. zero. to canonical [22:59:26] which for m. and zero. is .org, which redirects.conf redirects to www. [22:59:29] Great stuff :) [22:59:30] 22:58:10 modules/contint/manifests/packages/ruby.pp:41 WARNING top-scope variable being used without an explicit namespace (variable_scope) [22:59:33] uh woops [22:59:34] This may be a problem. [22:59:39] But nothing caused by today's events. [22:59:55] https://github.com/wikimedia/puppet/blob/4124a2e2da85b12633040e6901c1e1ad955c8fd2/modules/varnish/templates/text-frontend.inc.vcl.erb#L99-L106 [23:00:01] (03PS7) 10Paladox: contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 [23:00:22] I'm not 100% sure, but I suspect this is the code makes it so that in reality, apache is never reached with Host m.wikipedia.org or zero [23:01:33] !log Decommissioning Cassandra, restbase1014-a.eqiad.wmnet (T179422) [23:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:38] T179422: Reshape RESTBase Cassandra clusters - https://phabricator.wikimedia.org/T179422 [23:02:09] (03PS1) 10Chad: extract2.php: Stop allowing the www portal templates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391353 [23:02:18] (03CR) 10Dzahn: [C: 031] contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 (owner: 10Paladox) [23:02:45] (03CR) 10Hoo man: "@Elukey: Do you deem this worth working on or shall we ditch it?" [puppet] - 10https://gerrit.wikimedia.org/r/386662 (owner: 10Hoo man) [23:03:34] (03PS2) 10Chad: extract2.php: Stop allowing the www portal templates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391353 [23:05:25] Krinkle: That would explain things. [23:07:02] 10Operations, 10Discovery, 10Wikimedia-Apache-configuration, 10Mobile: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#3761037 (10Krinkle) Note that as of writing (November 2017) the "m.wikipedia.org" and "zero.wikipedia.org" landing domains **do** act... [23:07:09] 10Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration: m.{project}.org portal/redirect consistency - https://phabricator.wikimedia.org/T78421#3761041 (10Krinkle) [23:07:12] 10Operations, 10Discovery, 10Wikimedia-Apache-configuration, 10Mobile: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#3761039 (10Krinkle) 05stalled>03Open p:05Normal>03High [23:07:30] 10Operations, 10Discovery, 10Wikimedia-Apache-configuration, 10Mobile: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#730151 (10Krinkle) [23:07:34] PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:07:56] (03PS1) 10Chad: Remove www.*.org symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/391355 [23:10:16] 10Operations, 10Discovery, 10Reading-Infrastructure-Team-Backlog, 10Wikimedia-Apache-configuration, and 2 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3761048 (10Krinkle) [23:18:12] (03CR) 10Dzahn: [C: 032] contint: Fix support for stretch in ruby.pp [puppet] - 10https://gerrit.wikimedia.org/r/391351 (owner: 10Paladox) [23:21:42] no_justification: I've reached out to reading-wmf@ about the issue. [23:26:15] (03PS1) 10Ayounsi: DNS: Only send eqsin countries to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/391357 [23:27:40] (03PS2) 10Ayounsi: DNS: Only send eqsin countries to ulsfo [dns] - 10https://gerrit.wikimedia.org/r/391357 [23:28:10] Krinkle: Subscribed to the task [23:29:17] !log releases1001 - chmod g+w /srv/org/wikimedia/releases/mediawiki/1.* [23:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:43] 10Operations, 10Discovery, 10Wikimedia-Apache-configuration, 10Zero, and 2 others: m.wikipedia.org and zero.wikipedia.org should redirect how/where - https://phabricator.wikimedia.org/T69015#3761093 (10Mholloway) [23:37:35] RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:46:12] 10Operations, 10Release Pipeline, 10Release-Engineering-Team (Kanban): Upgrade latest docker-registry.wikimedia.org/nodejs-devel to stretch - https://phabricator.wikimedia.org/T180524#3760612 (10dduvall) [23:50:04] (03CR) 10BBlack: [C: 031] "Looks right. Let's hold deploy for ~Weds morning US time so we can be around to stare at stats, etc?" [dns] - 10https://gerrit.wikimedia.org/r/391357 (owner: 10Ayounsi) [23:52:44] RECOVERY - cassandra-b CQL 10.192.16.166:9042 on restbase2002 is OK: TCP OK - 0.036 second response time on 10.192.16.166 port 9042