[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181205T0000). [00:00:04] dmaza and dcausse: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:16] o/ [00:00:49] I can swat [00:01:01] dmaza: around? [00:01:02] I'm here [00:01:13] lets do this :) [00:01:22] sure! [00:01:57] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476882 (https://phabricator.wikimedia.org/T210452) (owner: 10Dmaza) [00:07:08] (03PS2) 10DCausse: Enable Block notice stats on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476882 (https://phabricator.wikimedia.org/T210452) (owner: 10Dmaza) [00:08:26] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476882 (https://phabricator.wikimedia.org/T210452) (owner: 10Dmaza) [00:09:25] (03Merged) 10jenkins-bot: Enable Block notice stats on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476882 (https://phabricator.wikimedia.org/T210452) (owner: 10Dmaza) [00:10:10] !log bootstrapping cassandra-b, restbase2016 -- T210843 [00:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:14] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [00:10:21] RECOVERY - cassandra-b SSL 10.192.32.111:7001 on restbase2016 is OK: SSL OK - Certificate restbase2016-b valid until 2020-11-29 09:26:15 +0000 (expires in 725 days) [00:10:25] RECOVERY - cassandra-b service on restbase2016 is OK: OK - cassandra-b is active [00:10:57] (03CR) 10jenkins-bot: Enable Block notice stats on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476882 (https://phabricator.wikimedia.org/T210452) (owner: 10Dmaza) [00:11:28] dmaza: it's live on mwdebug1002, can you test your patch there? [00:11:43] thank you! and yes I can [00:11:46] checking [00:13:36] (03PS1) 10Volans: puppet: add PuppetMaster class [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) [00:15:24] everything looks good here [00:15:40] great, deploying [00:16:18] dcausse: thank you very much [00:16:21] let me know when we are live [00:16:51] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Block notice stats on itwiki (T210452) (duration: 00m 47s) [00:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:54] T210452: Enable block notice stats on itwiki - https://phabricator.wikimedia.org/T210452 [00:17:04] dmaza: it's live now [00:17:26] dcausse: awesome.. have a good evening [00:17:34] *night [00:17:36] dmaza: thanks! :) [00:17:59] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:18:13] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:18:24] meh [00:19:59] (03PS10) 10DCausse: [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) [00:20:01] (03PS19) 10DCausse: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) [00:20:03] (03PS9) 10DCausse: [cirrus] Start writing to psi & omega [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476271 (https://phabricator.wikimedia.org/T210381) [00:20:05] (03PS9) 10DCausse: [cirrus] Start using replica group settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476272 (https://phabricator.wikimedia.org/T210381) [00:20:07] (03PS11) 10DCausse: [cirrus] Cleanup transitional states [mediawiki-config] - 10https://gerrit.wikimedia.org/r/476273 (https://phabricator.wikimedia.org/T210381) [00:21:24] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:22:02] 10Operations, 10cloud-services-team, 10monitoring, 10User-fgiunchedi: Port DirectorySize diamond collector to a Prometheus exporter - https://phabricator.wikimedia.org/T211094 (10colewhite) a:03colewhite [00:22:28] (03Merged) 10jenkins-bot: [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:24:14] (03CR) 10jenkins-bot: [cirrus] prepare multi-instance services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475749 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [00:28:57] !log dcausse@deploy1001 Synchronized wmf-config/ProductionServices.php: [cirrus] prepare multi-instance services (T210381) (duration: 00m 46s) [00:29:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:02] T210381: Update mw-config to use the psi&omega elastic clusters in codfw - https://phabricator.wikimedia.org/T210381 [00:30:10] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [cirrus] prepare multi-instance services (T210381) (duration: 00m 46s) [00:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:10] I'm going to stop here and deploy my next patch tomorrow (too tired) [00:32:22] !log Evening SWAT done [00:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:46] 10Operations, 10Analytics, 10ChangeProp, 10Services (designing), 10Wikimedia-Incident: Separate dev Change-Prop from production Kafka cluster - https://phabricator.wikimedia.org/T199427 (10Nuria) ping on this issue, has this work been planned? [00:35:29] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:44:05] 10Operations, 10Analytics, 10ChangeProp, 10Services (designing), 10Wikimedia-Incident: Separate dev Change-Prop from production Kafka cluster - https://phabricator.wikimedia.org/T199427 (10Pchelolo) p:05Normal>03Low We're currently not using CP in the dev cluster since we have no new major RESTBase f... [00:46:44] (03PS1) 10EBernhardson: Allow jupyterhub notebooks /dev/shm access [puppet] - 10https://gerrit.wikimedia.org/r/477711 (https://phabricator.wikimedia.org/T211163) [01:01:25] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [01:06:51] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [01:08:01] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [01:11:59] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [01:46:04] 10Operations, 10ops-codfw: Time on new servers different form time on puppetmaster1001 - https://phabricator.wikimedia.org/T211170 (10Papaul) [01:46:16] 10Operations, 10ops-codfw: Time on new servers different from time on puppetmaster1001 - https://phabricator.wikimedia.org/T211170 (10Papaul) [02:10:42] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 136.1 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [02:24:14] RECOVERY - cassandra-b CQL 10.192.32.111:9042 on restbase2016 is OK: TCP OK - 0.036 second response time on 10.192.32.111 port 9042 [02:25:49] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) [02:26:48] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Papaul) a:05Papaul>03Gehel @Gehel all yours [02:41:00] PROBLEM - IPMI Sensor Status on elastic2051 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical] [02:53:24] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): The new translation-server returns access date with the full time stamp; we should strip this - https://phabricator.wikimedia.org/T211127 (10Jonesey95) Reported so far at the following venues (that I have seen): https://en.wikipedia.org/... [03:06:49] !log kartik@deploy1001 Started deploy [cxserver/deploy@a3dd2ca]: Update cxserver to c4240e6 and enable Youdao MT (T208985, T210578) [03:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:55] T210578: Update Youdao MT client to new API - https://phabricator.wikimedia.org/T210578 [03:06:56] T208985: CX2: Support mapping templates based on their parameter names - https://phabricator.wikimedia.org/T208985 [03:11:16] !log kartik@deploy1001 Finished deploy [cxserver/deploy@a3dd2ca]: Update cxserver to c4240e6 and enable Youdao MT (T208985, T210578) (duration: 04m 26s) [03:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:39] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 662.45 seconds [03:50:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [03:54:17] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [03:54:27] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [03:55:41] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [03:56:47] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [03:57:03] PROBLEM - puppet last run on labvirt1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:57:51] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [03:58:31] RECOVERY - cassandra-c service on restbase2016 is OK: OK - cassandra-c is active [03:58:43] RECOVERY - Check systemd state on restbase2016 is OK: OK - running: The system is fully operational [03:58:55] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [03:59:05] RECOVERY - cassandra-c SSL 10.192.32.175:7001 on restbase2016 is OK: SSL OK - Certificate restbase2016-c valid until 2020-11-29 09:26:16 +0000 (expires in 725 days) [03:59:09] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [04:00:15] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [04:03:38] !log bootstrapping cassandra-c, restbase2016 -- T210843 [04:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:03:42] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [04:14:42] (03PS1) 10Andrew Bogott: Horizon: move 'incubator' project to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/477715 (https://phabricator.wikimedia.org/T204745) [04:15:28] (03CR) 10Andrew Bogott: [C: 032] Horizon: move 'incubator' project to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/477715 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [04:23:03] RECOVERY - puppet last run on labvirt1008 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [05:29:47] (03PS1) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 [05:34:04] (03PS2) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) [06:00:39] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:03:19] RECOVERY - cassandra-c CQL 10.192.32.175:9042 on restbase2016 is OK: TCP OK - 0.036 second response time on 10.192.32.175 port 9042 [06:15:18] RECOVERY - High load average on labstore1007 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:25:37] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 265.10 seconds [06:28:37] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:29:03] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.008 second response time [06:29:15] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:30:33] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [06:31:41] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [06:37:33] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 348 bytes in 0.541 second response time [06:37:45] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:39:33] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:43:11] PROBLEM - High load average on labstore1007 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [06:50:59] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477724 (https://phabricator.wikimedia.org/T86338) [06:52:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477724 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:53:56] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477724 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:55:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 T86338 T202167 (duration: 00m 51s) [06:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:20] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:55:21] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:55:22] !log Deploy schema change on db1091 T86338 T202167 [06:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:56] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477724 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [07:02:29] RECOVERY - High load average on labstore1007 is OK: OK: Less than 85.00% above the threshold [16.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [07:03:25] 10Operations, 10CirrusSearch, 10Discovery-Search: Find an alternative to curl connection pooling available in HHVM - https://phabricator.wikimedia.org/T210717 (10Joe) >>! In T210717#4786321, @EBernhardson wrote: > Throwing some ideas out there: > > * PHP requests are stateless, and trying to share something... [07:24:41] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477726 [07:25:45] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477726 (owner: 10Marostegui) [07:26:47] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477726 (owner: 10Marostegui) [07:27:00] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477726 (owner: 10Marostegui) [07:27:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 T86338 T202167 (duration: 00m 46s) [07:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:51] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:27:51] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [07:59:36] !log bootstrap cassandra-a, restbase2017 - T210843 [07:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:40] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [08:01:37] !log installing pdns-recursor security update in esams [08:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:52] (03PS1) 10Elukey: druid: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/477728 [08:16:36] (03CR) 10Elukey: [C: 032] druid: remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/477728 (owner: 10Elukey) [08:18:28] (03CR) 10Filippo Giunchedi: "I believe this setting isn't used because we've switched cassandra to prometheus metrics (all clusters but maps). Let's set this to a bogu" [puppet] - 10https://gerrit.wikimedia.org/r/477602 (https://phabricator.wikimedia.org/T209357) (owner: 10Dzahn) [08:18:38] (03CR) 10Filippo Giunchedi: "I believe this setting isn't used because we've switched cassandra to prometheus metrics (all clusters but maps). Let's set this to a bogu" [puppet] - 10https://gerrit.wikimedia.org/r/477604 (https://phabricator.wikimedia.org/T209357) (owner: 10Dzahn) [08:43:22] * addshore is going to enable a log channel quickly [08:43:29] marostegui: ping ^^ as its a mediawiki-config patch :) [08:43:32] and I see your working int here [08:43:43] (03PS3) 10Addshore: Define a new 'Wikibase' log channel to use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474185 (https://phabricator.wikimedia.org/T207850) [08:44:30] (03CR) 10Addshore: [C: 032] Define a new 'Wikibase' log channel to use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474185 (https://phabricator.wikimedia.org/T207850) (owner: 10Addshore) [08:45:32] (03Merged) 10jenkins-bot: Define a new 'Wikibase' log channel to use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474185 (https://phabricator.wikimedia.org/T207850) (owner: 10Addshore) [08:48:22] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T207850 Define a new Wikibase log channel to use (duration: 00m 47s) [08:48:24] all done [08:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:26] T207850: Consolidate the log groups used within Wikibase & Wikibase extensions. - https://phabricator.wikimedia.org/T207850 [08:54:16] (03CR) 10jenkins-bot: Define a new 'Wikibase' log channel to use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/474185 (https://phabricator.wikimedia.org/T207850) (owner: 10Addshore) [08:59:34] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 (10Joe) p:05Triage>03Normal [09:00:10] <_joe_> !log disabed puppet on mw1261, used for logging tests for T211184 [09:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:13] T211184: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 [09:02:35] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 (10Joe) a:03Joe [09:02:53] (03CR) 10Muehlenhoff: "That patch looks fine, but let's doublecheck that the intended outcome is correct. Andrew, can you confirm that the memcached metrics for " [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [09:03:14] (03PS2) 10Alexandros Kosiaris: upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [09:03:25] (03CR) 10Alexandros Kosiaris: "PCC says noop at https://puppet-compiler.wmflabs.org/compiler1002/13842/" [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [09:04:26] (03CR) 10jerkins-bot: [V: 04-1] upgrade puppet stdlib from 4.22.0 to 4.24.0 [puppet] - 10https://gerrit.wikimedia.org/r/475260 (owner: 10Dzahn) [09:07:10] !log matomo read only + upgrade to matomo 3.7.0 on matomo1001 - T209808 [09:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:14] T209808: Upgrade Matomo to 3.6.1 or 3.7.0 - https://phabricator.wikimedia.org/T209808 [09:13:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477731 [09:15:15] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477731 (owner: 10Marostegui) [09:16:16] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477731 (owner: 10Marostegui) [09:17:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1080 for MySQL upgrade (duration: 00m 46s) [09:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:12] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477731 (owner: 10Marostegui) [09:22:45] !log Stop MySQL on db1080 for mysql and kernel upgrade [09:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:39] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install elastic203[7-9], elastic204[0-9], elastic205[0-4] - https://phabricator.wikimedia.org/T210450 (10Gehel) @Papaul: thanks! We'll take it from here, and notify you as soon as the old servers are ready for decommission. [09:26:29] (03PS2) 10Gehel: elasticsearch: add new elastic2045-elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477523 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe) [09:32:03] !log setting up new elasticsearch servers on codfw - elastic2045-2054 - T210265 [09:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:07] T210265: Setup elasticsearch on new codfw servers - https://phabricator.wikimedia.org/T210265 [09:32:19] (03CR) 10Gehel: [C: 032] elasticsearch: add new elastic2045-elastic2054 [puppet] - 10https://gerrit.wikimedia.org/r/477523 (https://phabricator.wikimedia.org/T210265) (owner: 10Mathew.onipe) [09:34:50] (03CR) 10Urbanecm: [C: 031] Add shnwiki to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473510 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy) [09:38:52] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477737 [09:40:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477737 (owner: 10Marostegui) [09:41:46] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477737 (owner: 10Marostegui) [09:42:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly pool db1080 (duration: 00m 46s) [09:42:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:47] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477737 (owner: 10Marostegui) [09:50:56] (03PS4) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [09:51:49] (03PS1) 10Marostegui: db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477740 [09:55:47] PROBLEM - puppet last run on elastic2053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:56:26] (03PS1) 10Urbanecm: Define 2 new namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477741 (https://phabricator.wikimedia.org/T205546) [09:58:09] PROBLEM - puppet last run on elastic2054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:00:28] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477740 (owner: 10Marostegui) [10:01:31] (03Merged) 10jenkins-bot: db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477740 (owner: 10Marostegui) [10:01:45] (03CR) 10jenkins-bot: db-eqiad.php: Increase weight for db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477740 (owner: 10Marostegui) [10:02:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Increase weight for db1080 (duration: 00m 46s) [10:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:50] Request from 2001:14ba:2bf5:3600:xxxx via cp1081 cp1081, Varnish XID 269739509; Error: 503, Backend fetch failed [10:04:54] any way to debug this? [10:07:09] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477745 [10:07:54] (03PS1) 10Mathew.onipe: elasticsearch: add rack info for elastic2053-2054 [puppet] - 10https://gerrit.wikimedia.org/r/477746 [10:08:16] Nikerabbit: which wiki? [10:09:55] (03CR) 10Gehel: [C: 032] elasticsearch: add rack info for elastic2053-2054 [puppet] - 10https://gerrit.wikimedia.org/r/477746 (owner: 10Mathew.onipe) [10:10:09] !log depooling labsdb1010 for testing materialized views - T210693 [10:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:13] T210693: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 [10:10:54] (03CR) 10Banyek: [C: 032] wiki replicas: depool labsdb1010 for testing materialized view [puppet] - 10https://gerrit.wikimedia.org/r/477624 (https://phabricator.wikimedia.org/T210693) (owner: 10Bstorm) [10:11:04] (03PS2) 10Banyek: wiki replicas: depool labsdb1010 for testing materialized view [puppet] - 10https://gerrit.wikimedia.org/r/477624 (https://phabricator.wikimedia.org/T210693) (owner: 10Bstorm) [10:13:07] PROBLEM - cassandra-b service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:13:11] PROBLEM - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:13:15] PROBLEM - cassandra-c SSL 10.192.48.126:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:13:17] PROBLEM - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.122 and port 9042: Connection refused [10:13:17] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:13:23] PROBLEM - cassandra-a CQL 10.192.48.121:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.121 and port 9042: Connection refused [10:13:29] PROBLEM - cassandra-c CQL 10.192.48.126:9042 on restbase2018 is CRITICAL: connect to address 10.192.48.126 and port 9042: Connection refused [10:13:35] PROBLEM - cassandra-b service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [10:13:37] PROBLEM - cassandra-a service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:13:49] PROBLEM - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:13:51] PROBLEM - cassandra-c CQL 10.192.48.123:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.123 and port 9042: Connection refused [10:13:51] PROBLEM - cassandra-c SSL 10.192.48.123:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:13:53] PROBLEM - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is CRITICAL: connect to address 10.192.48.124 and port 9042: Connection refused [10:13:53] PROBLEM - cassandra-c service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [10:13:53] elukey: ^ expected? [10:13:59] PROBLEM - cassandra-c service on restbase2018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [10:14:01] PROBLEM - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is CRITICAL: connect to address 10.192.48.125 and port 9042: Connection refused [10:14:15] PROBLEM - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:14:17] PROBLEM - Check systemd state on restbase2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:15:56] godog: I see in SAL you're doing some restbase work, so probably expected? [10:17:52] !log T205969 icinga downtime the load avg check in labstore1007 for 1 week [10:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:57] T205969: labstore1007: high load avg issue - https://phabricator.wikimedia.org/T205969 [10:18:16] gehel: sorry just seen the ping, no idea [10:18:31] but those might be the new nodes [10:18:47] RECOVERY - puppet last run on elastic2054 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:18:59] https://phabricator.wikimedia.org/T209615 [10:19:20] yeah Filippo is bootstrapping those probably [10:19:24] (03PS1) 10MarcoAurelio: Increase autoconfirmed count for Meta-Wiki to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477747 (https://phabricator.wikimedia.org/T211188) [10:19:48] elukey: yeah, that what I saw in SAL, looks like new nodes [10:20:01] super [10:20:17] RECOVERY - puppet last run on elastic2053 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:20:31] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2048 is CRITICAL: NRPE: Command check_ferm_active not defined [10:20:32] PROBLEM - Check size of conntrack table on elastic2050 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:21:14] 10Operations, 10Traffic, 10Privacy: Disable WMF-Last-Access cookies for wmfusercontent.org - https://phabricator.wikimedia.org/T210167 (10ema) @Nuria thoughts from #analytics on this? [10:21:37] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=elasticsearch [10:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:21] PROBLEM - Check size of conntrack table on elastic2052 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:22:21] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2050 is CRITICAL: NRPE: Command check_ferm_active not defined [10:23:22] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477745 (owner: 10Marostegui) [10:23:55] (03PS1) 10Muehlenhoff: Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454) [10:24:09] PROBLEM - Check size of conntrack table on elastic2045 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:24:09] PROBLEM - Check size of conntrack table on elastic2047 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:24:23] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477745 (owner: 10Marostegui) [10:25:28] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1080 (duration: 00m 46s) [10:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:29] I removed labsdb1010 from the proxy now I am waiting the connections to drain [10:26:01] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2052 is CRITICAL: NRPE: Command check_ferm_active not defined [10:26:19] 10Operations, 10Core Platform Team Backlog (Next), 10Patch-For-Review, 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10MoritzMuehlenhoff) [10:26:58] (03CR) 10Ema: [C: 032] cache_upload: hfp on frontends for large objects except for exp [puppet] - 10https://gerrit.wikimedia.org/r/477573 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [10:27:05] (03PS2) 10Ema: cache_upload: hfp on frontends for large objects except for exp [puppet] - 10https://gerrit.wikimedia.org/r/477573 (https://phabricator.wikimedia.org/T144187) [10:27:11] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477745 (owner: 10Marostegui) [10:27:52] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2045 is CRITICAL: NRPE: Command check_ferm_active not defined [10:27:52] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2047 is CRITICAL: NRPE: Command check_ferm_active not defined [10:28:47] (03PS3) 10Ema: cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/477574 (https://phabricator.wikimedia.org/T144187) [10:28:49] gehel: indeed, I was a bit optimistic with the downtime, thanks for the heads up! [10:29:10] godog: no problem, as long as you're aware of what's going on! [10:29:35] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477747 (https://phabricator.wikimedia.org/T211188) (owner: 10MarcoAurelio) [10:29:51] PROBLEM - Check size of conntrack table on elastic2049 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:29:51] PROBLEM - Check size of conntrack table on elastic2051 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:30:08] (03CR) 10Ema: [C: 032] cache: stop using nhw admission policy [puppet] - 10https://gerrit.wikimedia.org/r/477574 (https://phabricator.wikimedia.org/T144187) (owner: 10Ema) [10:31:41] PROBLEM - Check size of conntrack table on elastic2053 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:32:35] those ferm warning above on elastic servers seems to be false positive, ferm is active and rules are defined, but it looks like there is some rename of the check command that is messed up. Looking [10:33:03] (03CR) 10Volans: "Few ideas for improvement inline" (037 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [10:33:13] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2049 is CRITICAL: NRPE: Command check_ferm_active not defined [10:33:13] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2051 is CRITICAL: NRPE: Command check_ferm_active not defined [10:34:30] (03PS1) 10Alexandros Kosiaris: Rake: Support ignoring upstream modules [puppet] - 10https://gerrit.wikimedia.org/r/477754 [10:34:42] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2045 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:42] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2045 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:34:42] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2046 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:42] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2047 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:42] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2047 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:34:42] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2048 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:34:42] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2049 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:43] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2049 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:34:43] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2050 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:44] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2050 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:34:44] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2051 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:45] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2051 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:34:45] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2052 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:46] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2052 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:34:46] ACKNOWLEDGEMENT - Check size of conntrack table on elastic2053 is CRITICAL: NRPE: Command check_conntrack_table_size not defined Gehel new servers, failing check under investigation [10:34:46] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477755 [10:34:47] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on elastic2053 is CRITICAL: NRPE: Command check_ferm_active not defined Gehel new servers, failing check under investigation [10:36:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477755 (owner: 10Marostegui) [10:36:55] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2046 is CRITICAL: NRPE: Command check_ferm_active not defined [10:37:15] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477755 (owner: 10Marostegui) [10:38:15] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3317 (duration: 00m 46s) [10:38:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:13] PROBLEM - Check size of conntrack table on elastic2054 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:39:16] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477756 [10:40:19] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477756 (owner: 10Marostegui) [10:40:22] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3317 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477755 (owner: 10Marostegui) [10:40:25] 10Operations, 10ops-codfw, 10netops: upgrade all codfw switch stacks to include additional 10G switch per row - https://phabricator.wikimedia.org/T196489 (10elukey) [10:40:47] PROBLEM - Check size of conntrack table on elastic2048 is CRITICAL: NRPE: Command check_conntrack_table_size not defined [10:41:07] PROBLEM - Check whether ferm is active by checking the default input chain on elastic2054 is CRITICAL: NRPE: Command check_ferm_active not defined [10:41:21] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477756 (owner: 10Marostegui) [10:41:58] jouncebot: next [10:41:58] In 1 hour(s) and 18 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181205T1200) [10:42:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3312 (duration: 00m 46s) [10:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:29] !log Stop MySQL on db1090:3312 and db1090:3317 for MySQL upgrade [10:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:13] RECOVERY - Long running screen/tmux on notebook1003 is OK: OK: Tmux detected but not long running. [10:44:31] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2052 is OK: OK ferm input default policy is set [10:44:35] (03CR) 10Mobrovac: [C: 04-1] service::node: add the 'use_nodejs10' parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [10:44:43] !log uploaded jenkins 2.138.4 to jessie-wikimedia/thirdparty and stretch-wikimedia/thirdpary/ci [10:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:02] 10Operations, 10ops-codfw, 10netops: codfw row B recable and add QFX - https://phabricator.wikimedia.org/T210456 (10elukey) No mcrouter codfw proxies present in B4, all good. [10:45:09] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2048 is OK: OK ferm input default policy is set [10:45:09] RECOVERY - Check size of conntrack table on elastic2050 is OK: OK: nf_conntrack is 0 % full [10:45:21] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2045 is OK: OK ferm input default policy is set [10:45:21] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2047 is OK: OK ferm input default policy is set [10:45:21] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2051 is OK: OK ferm input default policy is set [10:45:21] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2049 is OK: OK ferm input default policy is set [10:45:21] RECOVERY - Check size of conntrack table on elastic2048 is OK: OK: nf_conntrack is 0 % full [10:45:29] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2050 is OK: OK ferm input default policy is set [10:45:43] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2054 is OK: OK ferm input default policy is set [10:45:59] RECOVERY - Check size of conntrack table on elastic2053 is OK: OK: nf_conntrack is 0 % full [10:46:01] RECOVERY - Check whether ferm is active by checking the default input chain on elastic2046 is OK: OK ferm input default policy is set [10:46:01] RECOVERY - Check size of conntrack table on elastic2047 is OK: OK: nf_conntrack is 0 % full [10:46:01] RECOVERY - Check size of conntrack table on elastic2045 is OK: OK: nf_conntrack is 0 % full [10:46:07] RECOVERY - Check size of conntrack table on elastic2054 is OK: OK: nf_conntrack is 0 % full [10:46:09] 10Operations, 10ops-codfw, 10netops: codfw row A recable and add QFX - https://phabricator.wikimedia.org/T210447 (10elukey) No mcrouter proxies on A4, all good. [10:46:09] RECOVERY - Check size of conntrack table on elastic2051 is OK: OK: nf_conntrack is 0 % full [10:46:09] RECOVERY - Check size of conntrack table on elastic2049 is OK: OK: nf_conntrack is 0 % full [10:46:21] !log Reboot db1090 for kernel upgrade [10:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:38] RECOVERY - Check size of conntrack table on elastic2052 is OK: OK: nf_conntrack is 0 % full [10:51:55] !log cache hosts: begin nginx rolling upgrade to 1.13.6-2+wmf2 [10:51:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:23] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477758 [10:53:28] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477758 [10:53:33] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477756 (owner: 10Marostegui) [10:54:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477758 (owner: 10Marostegui) [10:55:31] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477758 (owner: 10Marostegui) [10:56:11] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477759 [10:56:15] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477759 [10:56:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3317 (duration: 00m 45s) [10:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:30] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477759 (owner: 10Marostegui) [10:58:22] (03CR) 10Volans: [C: 031] "LGTM, not merging right now as I don't know the status of the related code in prod, but feel free to ping if you need it merged." [puppet] - 10https://gerrit.wikimedia.org/r/475579 (https://phabricator.wikimedia.org/T210312) (owner: 10Legoktm) [10:58:36] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477759 (owner: 10Marostegui) [10:59:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3312 (duration: 00m 45s) [10:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:04] RECOVERY - cassandra-a CQL 10.192.48.121:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.121 port 9042 [11:03:02] 10Operations, 10MediaWiki-Logging, 10Wikimedia-Logstash: Move mediawiki to new logging infrastructure - https://phabricator.wikimedia.org/T211124 (10fgiunchedi) [11:03:39] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Next), 10Services (next): Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10fgiunchedi) [11:04:50] !log bootstrap cassandra-b, restbase2017 - T210843 [11:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:59] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [11:05:42] RECOVERY - cassandra-b service on restbase2017 is OK: OK - cassandra-b is active [11:05:50] RECOVERY - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-b valid until 2020-11-29 09:26:18 +0000 (expires in 724 days) [11:05:51] (03PS1) 10Mathew.onipe: elasticsearch: Remove elastic2001-elastic2024 from codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) [11:05:58] !log upgrade kubernetes-client and kubernetes-master on staging to 1.10.11 [11:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:59] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3317" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477758 (owner: 10Marostegui) [11:07:01] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477759 (owner: 10Marostegui) [11:19:02] (03CR) 10Arturo Borrero Gonzalez: "We should clarify first which server to use to store metrics." [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [11:21:20] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Afandian) It appears I supplied a key in the wrong format. I believe this is preventing m... [11:22:03] (03CR) 10Volans: "Replies inline" (036 comments) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/471298 (https://phabricator.wikimedia.org/T208066) (owner: 10Cwhite) [11:23:02] !log mobrovac@deploy1001 Started deploy [citoid/deploy@b10e034]: Truncate Zotero-reported time stamp to date - T211127 [11:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:06] T211127: The new translation-server returns access date with the full time stamp; we should strip this - https://phabricator.wikimedia.org/T211127 [11:25:16] (03PS1) 10Ema: ATS: do not add Client-IP and Via to request headers [puppet] - 10https://gerrit.wikimedia.org/r/477763 (https://phabricator.wikimedia.org/T207048) [11:27:40] (03CR) 10Elukey: service::node: add the 'use_nodejs10' parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [11:28:58] !log mobrovac@deploy1001 Finished deploy [citoid/deploy@b10e034]: Truncate Zotero-reported time stamp to date - T211127 (duration: 05m 55s) [11:28:59] (03CR) 10Elukey: service::node: add the 'use_nodejs10' parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [11:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:01] T211127: The new translation-server returns access date with the full time stamp; we should strip this - https://phabricator.wikimedia.org/T211127 [11:31:09] jouncebot: next [11:31:11] In 0 hour(s) and 28 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181205T1200) [11:32:23] (03CR) 10Alexandros Kosiaris: [C: 031] "This should fix the error issued by jenkins at https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/475260/" [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris) [11:32:34] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): Some regressions in production with Zotero translation-server in production at all - https://phabricator.wikimedia.org/T211114 (10mobrovac) [11:32:44] 10Operations, 10Citoid, 10Regression, 10VisualEditor (Current work): The new translation-server returns access date with the full time stamp; we should strip this - https://phabricator.wikimedia.org/T211127 (10mobrovac) 05Open>03Resolved p:05Unbreak!>03High a:03Mvolz Deployed, should be fixed now... [11:34:00] (03PS5) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [11:38:02] (03CR) 10Muehlenhoff: service::node: add the 'use_nodejs10' parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [11:40:34] (03CR) 10Muehlenhoff: [C: 031] service::node: add the 'use_nodejs10' parameter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [11:41:30] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10akosiaris) >>! In T197242#4798702, @Pchelolo wrote: > Seems like after this has been done the citation alert... [11:42:03] (03PS6) 10Elukey: service::node: add the 'use_nodejs10' parameter [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) [11:47:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Rake: Support ignoring upstream modules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477754 (owner: 10Alexandros Kosiaris) [11:51:22] 10Operations, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Maintenance-scripts, and 3 others: cronspam cleanup: Cron /usr/local/bin/foreachwiki maintenance/cleanupUploadStash.php > /dev/null - https://phabricator.wikimedia.org/T150375 (10Aklapper) @Joe: Could you please reply to t... [11:52:12] (03PS1) 10Ema: ATS: add settings for RAM Cache [puppet] - 10https://gerrit.wikimedia.org/r/477767 (https://phabricator.wikimedia.org/T207048) [11:56:42] (03CR) 10Ema: [C: 032] ATS: do not add Client-IP and Via to request headers [puppet] - 10https://gerrit.wikimedia.org/r/477763 (https://phabricator.wikimedia.org/T207048) (owner: 10Ema) [11:56:56] (03CR) 10Ema: [C: 032] ATS: add settings for RAM Cache [puppet] - 10https://gerrit.wikimedia.org/r/477767 (https://phabricator.wikimedia.org/T207048) (owner: 10Ema) [11:59:46] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477768 (https://phabricator.wikimedia.org/T128546) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the European Mid-day SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181205T1200). [12:00:04] dcausse, jan_drewniak, Urbanecm, and Hauskatze: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:13] meow! [12:00:15] o/ [12:00:18] * Urbanecm waves [12:00:32] o/ [12:01:00] I'd appreciate moving my patch to the start of SWAT, I have only few minutes :( [12:01:12] I can SWAT (expect the portal patch) [12:01:22] jan_drewniak: can you take care of the deploy? [12:01:37] dcausse: yup! I can do my own [12:02:00] Urbanecm: swating your patch [12:02:06] thanks [12:02:08] !log depooling mathoid and citoid servers on codfw for k8s upgrade [12:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:23] (03PS2) 10Banyek: mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) [12:03:26] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477741 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [12:04:28] (03Merged) 10jenkins-bot: Define 2 new namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477741 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [12:04:46] (03PS2) 10Banyek: mariadb: depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477592 (https://phabricator.wikimedia.org/T85757) [12:07:44] dcausse, after deploying, please run namespaceDupes.php [12:08:26] PROBLEM - DPKG on acrab is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:08:48] Urbanecm: do you want to test on mwdebug1002? [12:09:13] it's live there btw [12:09:17] Looking [12:09:57] looks good to me [12:10:01] please deploy&run the script [12:10:28] sure [12:10:50] RECOVERY - DPKG on acrab is OK: All packages OK [12:11:36] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T205546: Define 2 new namespaces for yuewiktionary (duration: 00m 47s) [12:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:40] T205546: Create Wiktionary Cantonese - https://phabricator.wikimedia.org/T205546 [12:11:40] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 131 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [12:11:59] Gotta leave now, sorry. Thanks a lot! [12:12:29] Urbanecm: sure, yw! [12:13:32] (03CR) 10jenkins-bot: Define 2 new namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477741 (https://phabricator.wikimedia.org/T205546) (owner: 10Urbanecm) [12:15:22] !log running namespaceDupes & cirrus indexNamespaces on yuewiktionary [12:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:40] !log upgrading codfw k8s cluster to 1.10.11 [12:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:07] (03CR) 10Mobrovac: [C: 031] "PCC is now happy and so am I - https://puppet-compiler.wmflabs.org/compiler1002/13844/" [puppet] - 10https://gerrit.wikimedia.org/r/477475 (https://phabricator.wikimedia.org/T210704) (owner: 10Elukey) [12:16:24] Hauskatze: around? [12:16:36] yup dcausse [12:16:41] my patch cannot be tested [12:16:43] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477747 (https://phabricator.wikimedia.org/T211188) (owner: 10MarcoAurelio) [12:16:47] so you can deploy right away [12:16:47] sure [12:19:10] (03PS2) 10DCausse: Increase autoconfirmed count for Meta-Wiki to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477747 (https://phabricator.wikimedia.org/T211188) (owner: 10MarcoAurelio) [12:20:11] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477747 (https://phabricator.wikimedia.org/T211188) (owner: 10MarcoAurelio) [12:21:40] (03Merged) 10jenkins-bot: Increase autoconfirmed count for Meta-Wiki to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477747 (https://phabricator.wikimedia.org/T211188) (owner: 10MarcoAurelio) [12:23:28] !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T211188: Increase autoconfirmed count for Meta-Wiki to 5 (duration: 00m 47s) [12:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:31] T211188: Increase autoconfirmed threshold for Meta-Wiki - https://phabricator.wikimedia.org/T211188 [12:23:37] Hauskatze: done ^ [12:23:44] :D [12:23:45] thanks [12:23:48] yw! :) [12:23:50] will report on wiki [12:24:11] jan_drewniak: please go ahead, I'll do mine after, it's a bit long to test properly [12:24:21] dcausse: sure thing! [12:24:27] (03CR) 10Jdrewniak: [C: 032] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477768 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [12:24:30] PROBLEM - Check systemd state on acrux is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:25:52] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477768 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [12:27:23] pcoombe: it's live on mwdebug1002 if you wanna see [12:27:24] (03CR) 10jenkins-bot: Increase autoconfirmed count for Meta-Wiki to 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477747 (https://phabricator.wikimedia.org/T211188) (owner: 10MarcoAurelio) [12:27:26] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477768 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [12:28:01] alright! it's getting deployed [12:29:22] RECOVERY - Check systemd state on acrux is OK: OK - running: The system is fully operational [12:29:55] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:477768| Bumping portals to master (T202497)]] (duration: 00m 49s) [12:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:58] T202497: Add fundraising appeal on Wikipedia portal page - https://phabricator.wikimedia.org/T202497 [12:30:41] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:477768| Bumping portals to master (T202497)]] (duration: 00m 46s) [12:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:28] !log pooling mathoid and citoid again on codfw [12:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:39] boom! it's done. dcausse, it's all yours [12:31:44] jan_drewniak: thanks! [12:32:36] (03CR) 10DCausse: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:33:34] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10faidon) Some thoughts here: It would be ideal to differentiate between peering routes & transit routes in our HE peering and mark those appropriately (with our peering communities, lo... [12:33:38] (03Merged) 10jenkins-bot: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:34:24] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) a:03aborrero [12:34:36] PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:34:38] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:35:16] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:35:18] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:35:31] PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:35:58] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:00] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:00] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:04] PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:06] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:28] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:28] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:36] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:44] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:44] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:36:50] PROBLEM - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:37:04] PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:37:06] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:37:07] o_O [12:37:20] PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:37:20] PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:37:42] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:37:44] PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [12:38:17] (03CR) 10Mark Bergsma: [C: 032] Modernize and cleanup Coordinator [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775 (owner: 10Mark Bergsma) [12:39:25] (03Merged) 10jenkins-bot: Modernize and cleanup Coordinator [debs/pybal] - 10https://gerrit.wikimedia.org/r/447775 (owner: 10Mark Bergsma) [12:40:59] (03CR) 10jenkins-bot: [cirrus] Add temp clusters but still write to the old ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475750 (https://phabricator.wikimedia.org/T210381) (owner: 10DCausse) [12:42:47] !log dcausse@deploy1001 Synchronized wmf-config/CommonSettings.php: T210381: [cirrus] Add temp clusters but still write to the old ones 1/2 (duration: 00m 46s) [12:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:55] T210381: Update mw-config to use the psi&omega elastic clusters - https://phabricator.wikimedia.org/T210381 [12:42:59] for Darth Vader indeed [12:44:39] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: labtestn: introduce new physical net mapping [puppet] - 10https://gerrit.wikimedia.org/r/477769 (https://phabricator.wikimedia.org/T207663) [12:44:47] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: T210381: [cirrus] Add temp clusters but still write to the old ones 2/2 (duration: 00m 46s) [12:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:27] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=(cit|math)oid,name=codfw [12:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:04] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy [12:47:04] RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy [12:47:06] RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy [12:47:06] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy [12:47:12] <_joe_> there you go [12:47:28] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [12:47:28] RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy [12:47:32] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy [12:47:42] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy [12:47:50] !log EU SWAT done [12:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:04] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [12:48:06] RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy [12:48:10] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [12:48:22] RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy [12:48:36] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy [12:48:40] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy [12:48:40] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy [12:48:44] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [12:48:56] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy [12:49:00] RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy [12:49:00] RECOVERY - Restbase edge ulsfo on text-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [12:49:18] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy [12:49:30] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [12:50:26] RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy [12:51:40] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/13845/" [puppet] - 10https://gerrit.wikimedia.org/r/477769 (https://phabricator.wikimedia.org/T207663) (owner: 10Arturo Borrero Gonzalez) [12:53:55] !log uploaded python-thumbor-community-core_0.4.0-1+deb9u1 to stretch-wikimedia [12:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:37] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) `lang=shell-session root@labtestcontrol2003:~# neutron net-create 'wan-transport-codfw' --router:external=true --prov... [12:58:44] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) `lang=shell-session root@labtestcontrol2003:~# neutron subnet-create --gateway 208.80.153.185 --name cloud-instances-... [12:59:54] !log banning elastic2001-elastic2024 from codfw production, psi and omega clusters [12:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:52] (03PS3) 10Ladsgroup: Add shnwiki to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473510 (https://phabricator.wikimedia.org/T206777) (owner: 10Reedy) [13:06:54] (03PS1) 10Ladsgroup: Add yue to InterwikiSortOrders.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477771 (https://phabricator.wikimedia.org/T209820) [13:12:54] PROBLEM - Apache HTTP on mw1343 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [13:14:08] RECOVERY - Apache HTTP on mw1343 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 2.217 second response time [13:17:35] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) `lang=shell-session root@labtestcontrol2003:~# neutron router-gateway-set --fixed-ip subnet_id=cloud-instances-transp... [13:31:22] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10Ladsgroup) p:05Normal>03Triage [13:32:12] 10Operations, 10ORES, 10Scoring-platform-team, 10Release Pipeline (Blubber), 10Release-Engineering-Team (Backlog): Blubber should be able to make multi docker files per repo - https://phabricator.wikimedia.org/T210267 (10Ladsgroup) p:05Triage>03Normal I didn't change the priority. [13:41:09] (03CR) 10Banyek: mariadb: depool db1113:3315 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:41:21] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) I will try reusing the same network object to try to make this even cleaner. [13:41:44] (03PS3) 10Banyek: mariadb: depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) [13:43:43] (03CR) 10Marostegui: [C: 031] mariadb: depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477588 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:44:23] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1113:3315 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:45:13] (03CR) 10Marostegui: [C: 031] mariadb: depool db1110 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477592 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:47:08] (03PS4) 10Banyek: mariadb: depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) [13:47:51] (03CR) 10Marostegui: [C: 031] mariadb: depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477593 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:52:14] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:53:36] !log repool citoid/mathoid codfw [13:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:40] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:58:06] (03PS1) 10Hoo man: Enable arbitrary item/property access for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477777 (https://phabricator.wikimedia.org/T175273) [13:58:51] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [14:00:04] hoo: That opportune time is upon us again. Time for a Wikidata deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181205T1400). [14:01:29] (03CR) 10Hoo man: [C: 032] Enable arbitrary item/property access for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477777 (https://phabricator.wikimedia.org/T175273) (owner: 10Hoo man) [14:03:01] (03Merged) 10jenkins-bot: Enable arbitrary item/property access for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477777 (https://phabricator.wikimedia.org/T175273) (owner: 10Hoo man) [14:06:52] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Correctly collect logs from php-fpm pools - https://phabricator.wikimedia.org/T211184 (10fgiunchedi) I took a quick look at this as well and indeed `openlog()` seems the simplest... [14:09:08] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable arbitrary item/property access for all wiktionaries (T175273) (duration: 00m 47s) [14:09:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:11] T175273: Enable arbitrary access on other Wiktionaries - https://phabricator.wikimedia.org/T175273 [14:11:21] (03PS3) 10Elukey: Add the Hadoop worker nodes' racking awareness config [puppet] - 10https://gerrit.wikimedia.org/r/474904 (https://phabricator.wikimedia.org/T209929) [14:12:18] (03CR) 10Elukey: [C: 032] Add the Hadoop worker nodes' racking awareness config [puppet] - 10https://gerrit.wikimedia.org/r/474904 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [14:13:01] (03CR) 10jenkins-bot: Enable arbitrary item/property access for all wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477777 (https://phabricator.wikimedia.org/T175273) (owner: 10Hoo man) [14:14:48] (03PS1) 10Elukey: Assign analytics_cluster::hadoop::worker to an-worker* [puppet] - 10https://gerrit.wikimedia.org/r/477779 (https://phabricator.wikimedia.org/T209929) [14:16:01] 10Operations, 10Toolforge, 10cloud-services-team (Kanban): tools-k8s-master-01 has two floating IPs - https://phabricator.wikimedia.org/T164123 (10GTirloni) Incident caused by removing the DNS entry: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181204-Toolforge-Kubernetes [14:20:50] (03CR) 10Filippo Giunchedi: "For more context, this is a "retry" of a previous patch that got reverted due to ferm + missing AAAA for labmon (now fixed): https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [14:21:08] (03CR) 10Filippo Giunchedi: [C: 031] wmcs: add prometheus-memcached-exporter [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [14:22:13] godog: \o/ [14:24:03] heheh indeed! [14:43:02] !log Running cleanupUsersWithNoId.php on metawiki for T181731 / T210985 [14:43:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:07] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [14:43:07] T210985: Fatal exception of type "InvalidArgumentException" when querying for Special:Contributions - https://phabricator.wikimedia.org/T210985 [14:48:01] (03CR) 10Ottomata: [C: 031] Assign analytics_cluster::hadoop::worker to an-worker* [puppet] - 10https://gerrit.wikimedia.org/r/477779 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [14:48:49] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've set it on ores1001 for a while and this was the result" [puppet] - 10https://gerrit.wikimedia.org/r/477302 (https://phabricator.wikimedia.org/T206333) (owner: 10Ladsgroup) [14:49:39] (03PS1) 10Kosta Harlan: Enable Help Panel on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477784 (https://phabricator.wikimedia.org/T211206) [14:51:26] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [14:51:40] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [14:51:46] !log depool citoid/eqiad: pooled changed True => False [14:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:56] !log depool mathoid/eqiad: pooled changed True => False [14:51:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:04] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [14:53:44] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [14:53:50] <_joe_> see fsero ^^ [14:53:54] <_joe_> cache expired :) [14:54:06] (03CR) 10Andrew Bogott: "+1 for the prometheus additions. I'm unclear about the labmon1001 bits though; for things running on physical hardware I'd expect metrics" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [14:54:30] !log restart HDFS namenode and Yarn resource manager on an-master100[1,2] to update rack topology config - T209929 [14:54:32] great _joe_ continuing with the next thing [14:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:34] T209929: Decommission old Hadoop worker nodes and add newer ones - https://phabricator.wikimedia.org/T209929 [14:54:57] !log upgrading k8s on eqiad to 1.10.11 [14:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:02] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [14:55:16] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [14:55:24] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [14:56:27] !log executing schema change on s5 codfw master replication lag could be expected - T85757 [14:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:30] T85757: Dropping user.user_options on wmf databases - https://phabricator.wikimedia.org/T85757 [14:57:49] 10Operations, 10Math, 10Patch-For-Review: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10Physikerwelt) [14:57:55] 10Operations, 10Math, 10MW-1.33-notes (1.33.0-wmf.8; 2018-12-11), 10Patch-For-Review: Remove unused i18n messages for math extension - https://phabricator.wikimedia.org/T210948 (10Physikerwelt) 05Open>03Resolved [14:58:16] 10Operations, 10Math, 10Patch-For-Review: Clean up artifacts from LaTeX based math rendering - https://phabricator.wikimedia.org/T195847 (10Physikerwelt) [14:58:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [14:59:58] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:00:06] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:00:12] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:00:38] PROBLEM - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:01:06] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [15:01:12] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy [15:01:18] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [15:02:20] RECOVERY - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.122 port 9042 [15:03:09] (03PS6) 10Mark Bergsma: Don't depool pooledDownServers in refreshPreexistingServer [debs/pybal] - 10https://gerrit.wikimedia.org/r/447769 (https://phabricator.wikimedia.org/T184715) [15:03:11] (03PS5) 10Mark Bergsma: Ensure that depool threshold is being honored on new/updated configs [debs/pybal] - 10https://gerrit.wikimedia.org/r/443967 (https://phabricator.wikimedia.org/T184715) (owner: 10Vgutierrez) [15:03:13] (03PS1) 10Mark Bergsma: Wait for onConfigUpdate initialization in setServers using inlineCallbacks [debs/pybal] - 10https://gerrit.wikimedia.org/r/477793 [15:03:38] !log bootstrap cassandra-c, restbase2017 - T210843 [15:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:41] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [15:03:42] !log repooling citoid mathoid eqiad [15:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:14] RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy [15:04:30] RECOVERY - cassandra-c service on restbase2017 is OK: OK - cassandra-c is active [15:04:36] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational [15:04:46] RECOVERY - cassandra-c SSL 10.192.48.123:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-c valid until 2020-11-29 09:26:19 +0000 (expires in 724 days) [15:07:03] !log Running cleanupUsersWithNoId.php on potentially missed s3 and s7 wikis for T181731 [15:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:06] T181731: Run maintenance/cleanupUsersWithNoId.php on all wikis - https://phabricator.wikimedia.org/T181731 [15:08:14] (03PS2) 10Elukey: Assign analytics_cluster::hadoop::worker to an-worker* [puppet] - 10https://gerrit.wikimedia.org/r/477779 (https://phabricator.wikimedia.org/T209929) [15:10:56] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [15:11:18] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:12:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:12:24] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [15:12:37] (03PS2) 10Mark Bergsma: Wait for onConfigUpdate initialization in setServers using inlineCallbacks [debs/pybal] - 10https://gerrit.wikimedia.org/r/477793 [15:12:39] (03PS6) 10Mark Bergsma: Ensure that depool threshold is being honored on new/updated configs [debs/pybal] - 10https://gerrit.wikimedia.org/r/443967 (https://phabricator.wikimedia.org/T184715) (owner: 10Vgutierrez) [15:12:41] (03PS1) 10Mark Bergsma: Call _updateServerMetrics from _serverInitDone [debs/pybal] - 10https://gerrit.wikimedia.org/r/477794 [15:13:26] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:14:34] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [15:14:41] (03CR) 10Elukey: [C: 032] Assign analytics_cluster::hadoop::worker to an-worker* [puppet] - 10https://gerrit.wikimedia.org/r/477779 (https://phabricator.wikimedia.org/T209929) (owner: 10Elukey) [15:15:07] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:16:10] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:16:12] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [15:16:13] (03PS7) 10Mark Bergsma: Ensure that depool threshold is being honored on new/updated configs [debs/pybal] - 10https://gerrit.wikimedia.org/r/443967 (https://phabricator.wikimedia.org/T184715) (owner: 10Vgutierrez) [15:16:15] (03PS2) 10Mark Bergsma: Call _updateServerMetrics from _serverInitDone [debs/pybal] - 10https://gerrit.wikimedia.org/r/477794 [15:18:02] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:18:30] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [15:20:22] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [15:22:30] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [15:27:00] (03PS1) 10Gilles: Use webp -exact option on Stretch [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/477796 (https://phabricator.wikimedia.org/T170817) [15:29:12] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) After 16 hours of hardware diagnostics, the server came up with no error. I have a Call schedule with Dell in 2 hours to discuss about the next step to take. {F27392018} [15:29:12] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [15:29:40] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:30:48] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [15:30:56] (03CR) 10GTirloni: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [15:31:32] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:31:56] 10Operations, 10ops-codfw, 10DBA, 10decommission: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Papaul) [15:33:21] !log add back pods/portforward right to kubernetes deploy user. T211040 [15:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:40] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:33:56] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:34:52] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [15:34:53] !log restarting ci jenkins for update [15:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:47] 10Operations, 10ops-codfw, 10DBA, 10decommission: Decommission parsercache hosts: pc2004 pc2005 pc2006 (Dec 2018 lease return) - https://phabricator.wikimedia.org/T209858 (10Papaul) @RobH any reason why we have to add the servers that we are returning to the decommission tracking Google sheet since that sh... [15:35:48] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:36:16] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [15:37:05] (03CR) 10Sbisson: [C: 04-1] Enable Help Panel on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477784 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [15:37:10] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:10] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:38:12] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [15:39:12] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:39:18] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [15:40:04] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:40:16] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [15:40:22] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [15:40:46] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [15:41:20] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:41:28] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:41:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Scrapes sample page) timed out before a response was received [15:43:16] is the citoid/restbase spam known/expected? can be silenced? [15:43:50] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [15:44:12] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [15:44:18] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:44:24] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:44:30] godog: no it's an actual problem. just not sure what exactly yet [15:44:32] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:44:55] akosiaris: kk, thanks! [15:44:56] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [15:45:06] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:45:18] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:45:18] (03PS2) 10Giuseppe Lavagetto: puppet-merge: allow only showing diffs without merging [puppet] - 10https://gerrit.wikimedia.org/r/476008 [15:45:24] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy [15:45:30] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [15:45:36] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:45:38] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:46:20] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-merge: allow only showing diffs without merging [puppet] - 10https://gerrit.wikimedia.org/r/476008 (owner: 10Giuseppe Lavagetto) [15:46:34] <_joe_> I'm merging a change to puppet-merge itself [15:46:38] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Cmjohnson) a:05Cmjohnson>03RobH @RobH all have been cabled and switch port updated minus the vlan. Can you please update vlan... [15:46:56] <_joe_> the next person that needs to merge a change should contact me if there are issues [15:47:51] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy [15:50:31] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:51:25] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [15:51:43] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [15:52:37] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:55:07] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [15:55:43] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [15:56:41] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [15:59:07] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [16:00:23] (03PS1) 10WMDE-Fisch: Set FileImporter config help location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477798 (https://phabricator.wikimedia.org/T199108) [16:00:43] (03CR) 10jerkins-bot: [V: 04-1] Set FileImporter config help location [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477798 (https://phabricator.wikimedia.org/T199108) (owner: 10WMDE-Fisch) [16:02:06] !log akosiaris@deploy1001 scap-helm zotero upgrade production --set resources.replicas=16 [namespace: zotero, clusters: eqiad,codfw] [16:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:19] !log akosiaris@deploy1001 scap-helm zotero upgrade production --set resources.replicas=16 stable/zotero [namespace: zotero, clusters: eqiad,codfw] [16:02:19] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [16:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:21] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [16:02:21] !log akosiaris@deploy1001 scap-helm zotero finished [16:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:35] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [16:03:37] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy [16:05:39] (03PS1) 10Andrew Bogott: cumin: don't mention contintcloud, it doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/477799 (https://phabricator.wikimedia.org/T211213) [16:07:32] (03CR) 10Volans: "question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477799 (https://phabricator.wikimedia.org/T211213) (owner: 10Andrew Bogott) [16:07:41] (03PS2) 10Andrew Bogott: cumin: don't mention contintcloud, it doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/477799 (https://phabricator.wikimedia.org/T211213) [16:08:32] !log redeploying zotero on eqiad [16:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:08] (03PS1) 10Giuseppe Lavagetto: puppet-merge: correctly add the new option everywhere [puppet] - 10https://gerrit.wikimedia.org/r/477801 [16:09:17] (03CR) 10Andrew Bogott: cumin: don't mention contintcloud, it doesn't exist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477799 (https://phabricator.wikimedia.org/T211213) (owner: 10Andrew Bogott) [16:09:23] (03CR) 10Andrew Bogott: [C: 032] cumin: don't mention contintcloud, it doesn't exist [puppet] - 10https://gerrit.wikimedia.org/r/477799 (https://phabricator.wikimedia.org/T211213) (owner: 10Andrew Bogott) [16:10:38] andrewbogott: it's also mentioned in modules/openstack/templates/ocata/horizon/local_settings.py.erb btw [16:10:50] yeah, only in a comment though I think [16:11:11] no, the comment is in modules/profile/manifests/ci/castor/server.pp [16:12:13] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 89.28 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1panelId=2fullscreen [16:12:13] Yeah… I don't know what any of that does, I will ask [16:13:01] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-merge: correctly add the new option everywhere [puppet] - 10https://gerrit.wikimedia.org/r/477801 (owner: 10Giuseppe Lavagetto) [16:13:11] (03PS2) 10Giuseppe Lavagetto: puppet-merge: correctly add the new option everywhere [puppet] - 10https://gerrit.wikimedia.org/r/477801 [16:13:25] 10Operations, 10Wikimedia-Logstash: Investigate approaches to ingest sensitive log producers - https://phabricator.wikimedia.org/T205855 (10herron) [16:15:34] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [16:15:36] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [16:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:36] !log akosiaris@deploy1001 scap-helm zotero finished [16:15:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:04] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [16:17:05] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [16:17:05] !log akosiaris@deploy1001 scap-helm zotero finished [16:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:36] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-eqiad.yaml stable/zotero [namespace: zotero, clusters: eqiad] [16:17:37] !log akosiaris@deploy1001 scap-helm zotero cluster eqiad completed [16:17:37] !log akosiaris@deploy1001 scap-helm zotero finished [16:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:40] !log akosiaris@deploy1001 scap-helm zotero upgrade production -f zotero-values-codfw.yaml stable/zotero [namespace: zotero, clusters: codfw] [16:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:42] !log akosiaris@deploy1001 scap-helm zotero cluster codfw completed [16:18:42] !log akosiaris@deploy1001 scap-helm zotero finished [16:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:20] (03PS1) 10Hashar: contint: rephrase comment for castor rsync server [puppet] - 10https://gerrit.wikimedia.org/r/477803 [16:20:06] PROBLEM - Disk space on Hadoop worker on an-worker1080 is CRITICAL: NRPE: Command check_disk_space_hadoop_worker not defined [16:20:12] (03PS2) 10Andrew Bogott: contint: rephrase comment for castor rsync server [puppet] - 10https://gerrit.wikimedia.org/r/477803 (owner: 10Hashar) [16:20:56] PROBLEM - puppet last run on an-worker1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[spark2_yarn_shuffle_jar_install] [16:21:07] RECOVERY - Disk space on Hadoop worker on an-worker1080 is OK: DISK OK [16:21:56] (03CR) 10Andrew Bogott: [C: 032] contint: rephrase comment for castor rsync server [puppet] - 10https://gerrit.wikimedia.org/r/477803 (owner: 10Hashar) [16:24:59] 10Operations, 10Wikimedia-Logstash, 10vm-requests: Spin up 3 logstash/kibana frontend VMs in codfw - https://phabricator.wikimedia.org/T211217 (10herron) p:05Triage>03Normal [16:25:07] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Cmjohnson) Waiting on a technician to swap out the motherboard. Our request was approved. [16:26:00] RECOVERY - puppet last run on an-worker1080 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:26:18] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10akosiaris) FWIW we 've had a number of minor outages and alerts resulting in increased latency for results.... [16:27:25] !log uploaded python-thumbor-wikimedia_2.2-1+deb9u1 to stretch-wikimedia [16:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:31] 10Operations, 10ops-eqiad, 10DBA: rack/setup/install pc1007-pc1010 - https://phabricator.wikimedia.org/T207258 (10Marostegui) Awesome! Thank you @Cmjohnson! :) If you get it online today, reminder: RAID5 with 256 stripe! (Reminding it because it is not the usual config) Thanks a lot [16:32:20] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10Sebastian_Berlin-WMSE) I'm not getting the correct response when using the translators I've written. A few f... [16:34:07] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "mostly LGTM, one more nitpick though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:34:18] _joe_ thanks! [16:34:53] <_joe_> paladox: just one small correction [16:35:29] yup [16:35:48] _joe_ though i noticed we use a source for this "puppet:///modules/phabricator/apache/mpm_prefork.conf" [16:35:58] which would not work for "worker" i guess [16:36:13] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10mobrovac) >>! In T197242#4800644, @Sebastian_Berlin-WMSE wrote: > I'm not getting the correct response when... [16:36:26] <_joe_> paladox: ugh sorry, yes [16:36:54] _joe_ do i make source undef if it's a worker? [16:37:14] <_joe_> paladox: not sure, should check the code for that class [16:37:27] ok [16:37:32] seems to support undef [16:37:35] as it's the default [16:38:35] (03PS27) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [16:38:40] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:39:01] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:39:03] (03CR) 10Paladox: phabricator: Add support for php-fpm in stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:39:10] (03CR) 10jerkins-bot: [V: 04-1] phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:39:13] <_joe_> paladox: that would give you the standard settings though [16:39:18] <_joe_> probably not what you want [16:39:20] hmm [16:39:28] <_joe_> lemme see what we do for mediawiki [16:39:33] thanks :) [16:40:29] <_joe_> paladox: mediawiki/apache/worker.conf.erb [16:40:36] ah /me looks [16:40:37] thanks! [16:41:12] <_joe_> you won't need the IfDefine SLOW [16:42:43] ok [16:43:15] (03PS3) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) [16:43:35] 10Operations, 10MediaWiki-Debug-Logger, 10Performance-Team: Set up request profiling for PHP 7 - https://phabricator.wikimedia.org/T206152 (10Krinkle) a:03Krinkle [16:44:03] (03PS1) 10Arturo Borrero Gonzalez: Revert "cloudvps: labtestn: introduce new physical net mapping" [puppet] - 10https://gerrit.wikimedia.org/r/477807 (https://phabricator.wikimedia.org/T207663) [16:45:10] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Revert "cloudvps: labtestn: introduce new physical net mapping" [puppet] - 10https://gerrit.wikimedia.org/r/477807 (https://phabricator.wikimedia.org/T207663) (owner: 10Arturo Borrero Gonzalez) [16:46:27] (03CR) 10CRusnov: "Blacken fixed all of your nitpicks :>" (037 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [16:46:57] ACKNOWLEDGEMENT - IPMI Sensor Status on elastic2051 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [PS Redundancy = Critical] Gehel tracked on https://phabricator.wikimedia.org/T211219 [16:49:17] (03PS28) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [16:49:20] _joe_ ^^ [16:49:38] 10Operations, 10Traffic, 10netops: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10ayounsi) I'm all for testing T204281, but it's probably wise to wait for January for that. Until then, a temporary fix can be to move HE from the peering group to the transit group. [16:49:50] (03PS29) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [16:50:03] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:50:30] (03PS4) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) [16:51:05] (03CR) 10CRusnov: "- Minor fix from testing." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [16:52:48] (03CR) 10Giuseppe Lavagetto: [C: 031] "The patch looks overall ok. Someone else should run the compiler to see what would change and be around to troubleshoot any possible fallo" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [16:53:03] !log activate ams-ix prefix list entry on cr2-esams [16:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:06] thank you _joe_! [16:53:31] <_joe_> paladox: thank you for that patch :) [16:53:39] your welcome :) [16:54:12] <_joe_> paladox: although in the future it could make sense to create a more generic "profile::php_fcgi" that we can reuse to setup apache + php-fpm everywhere [16:54:20] <_joe_> with little tweaks here and there [16:54:26] _joe_ ah yeh. [16:54:37] I was trying to do phabricator::php to make things look nicer [16:54:41] but it failed with the linter [16:54:55] <_joe_> but better to do things one by one than doing nothing :) [16:54:59] (03CR) 10Gehel: "minor comments inline, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477761 (https://phabricator.wikimedia.org/T211023) (owner: 10Mathew.onipe) [16:55:01] heh :) [16:55:11] <_joe_> it should really improve phab's performance [16:55:25] <_joe_> but expect to have failures when that patch gets applied [16:55:37] <_joe_> too many things have to change at the same time :) [16:56:04] _joe_ at least i've confirmed it works on stretch [16:56:07] <_joe_> mutante: I would suggest not applying that patch on a system that's serving production traffic tbh [16:56:08] in the cloud (labs) [16:56:19] <_joe_> paladox: yeah, the issue is just transitioning the apache config [16:56:23] <_joe_> and the php config [16:56:26] <_joe_> at the same time [16:56:45] <_joe_> it will most probably fail and need some manual repair [16:56:53] _joe_ the php change should only affect stretch (i've tryed to make sure i touched little php5) [16:57:22] <_joe_> paladox: sure, but I'm not sure if we run phab from stretch in production or not [16:57:31] _joe_ we doin't (yet) [16:57:34] <_joe_> ok [16:57:35] phab1001 is jessie [16:57:38] phab1002 is stretch [16:57:47] <_joe_> then it's safer :) [16:57:56] :) [16:58:16] !log re-deactivate ams-ix prefix list entry on cr2-esams [16:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:25] (03CR) 10Volans: Add an old hardware report (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [16:58:29] 10Operations, 10Mail, 10WMF-Legal: Tracking down gary@ and redirecting it to trustandsafety@ - https://phabricator.wikimedia.org/T210464 (10bcampbell) @jijiki I'll email Legal today and follow up with an answer on this thread. Thanks for your help. [16:59:10] _joe_: yea, it's jessie right now but i want to finally do the stretch switch to 1002 soon [16:59:14] (03PS1) 10Herron: add forward and reverse DNS for codfw logstash VMs [dns] - 10https://gerrit.wikimedia.org/r/477810 (https://phabricator.wikimedia.org/T211217) [16:59:20] but it means merging that won't affect prod yet.. ack [16:59:21] (03PS1) 10Andrew Bogott: Openstack: support multiple regions [software/cumin] - 10https://gerrit.wikimedia.org/r/477811 [16:59:36] <_joe_> mutante: test everything through the compiler though [16:59:40] <_joe_> it's a complex change [16:59:43] ok [17:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181205T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:05] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) Trying now with only adding a new subnet object: `lang=shell-session root@labtestcontrol2003:~# neutron subnet-creat... [17:00:16] (03PS2) 10Andrew Bogott: Openstack: support multiple regions [software/cumin] - 10https://gerrit.wikimedia.org/r/477811 (https://phabricator.wikimedia.org/T208861) [17:00:27] (03PS8) 10Paladox: Configuration for phabricator to use swift storage. [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [17:00:58] _joe_ i ran it through the compiler :) [17:01:08] https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/56/console [17:01:45] (03PS3) 10Andrew Bogott: Openstack: support multiple regions [software/cumin] - 10https://gerrit.wikimedia.org/r/477811 (https://phabricator.wikimedia.org/T208861) [17:03:06] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [17:04:45] (03CR) 10jerkins-bot: [V: 04-1] Openstack: support multiple regions [software/cumin] - 10https://gerrit.wikimedia.org/r/477811 (https://phabricator.wikimedia.org/T208861) (owner: 10Andrew Bogott) [17:06:32] (03PS2) 10Kosta Harlan: Enable Help Panel on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477784 (https://phabricator.wikimedia.org/T211206) [17:07:54] (03CR) 10Kosta Harlan: Enable Help Panel on beta labs (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477784 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [17:08:23] (03CR) 10Sbisson: [C: 032] Enable Help Panel on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477784 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [17:09:29] (03Merged) 10jenkins-bot: Enable Help Panel on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477784 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [17:11:27] !log add public IPs to codfw cloud-instance-transport1-b T207663 [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:30] T207663: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 [17:11:34] (03CR) 10Mathew.onipe: [C: 031] Remove Diamond on maps servers [puppet] - 10https://gerrit.wikimedia.org/r/477748 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [17:12:48] (03CR) 10jenkins-bot: Enable Help Panel on beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477784 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [17:14:17] (03PS1) 10Paladox: phabricator::vcs: Allow String or Array as puppet type for listen_address [puppet] - 10https://gerrit.wikimedia.org/r/477814 [17:15:23] 10Operations, 10Gerrit: Convert Gerrit to use H2 as the database after 2.16 upgrade - https://phabricator.wikimedia.org/T211139 (10Dzahn) p:05Triage>03Normal [17:16:09] (03PS2) 10Paladox: phabricator::vcs: Allow String or Array as puppet type for listen_address [puppet] - 10https://gerrit.wikimedia.org/r/477814 [17:18:07] (03PS1) 10Hoo man: Enable Kartographer maps on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477817 (https://phabricator.wikimedia.org/T184933) [17:18:31] (03CR) 10Hoo man: [C: 032] Enable Kartographer maps on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477817 (https://phabricator.wikimedia.org/T184933) (owner: 10Hoo man) [17:19:34] (03Merged) 10jenkins-bot: Enable Kartographer maps on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477817 (https://phabricator.wikimedia.org/T184933) (owner: 10Hoo man) [17:19:53] !log remove private IPs from codfw cloud-instance-transport1-b T207663 [17:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:57] T207663: Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 [17:20:13] (03PS1) 10Fdans: Revert "Add change_tag to list of tables to sqoop" [puppet] - 10https://gerrit.wikimedia.org/r/477818 [17:20:45] 10Operations: puppet (systemd::service) attempts to start masked units - https://phabricator.wikimedia.org/T211027 (10Dzahn) https://tickets.puppetlabs.com/browse/PUP-1253 [17:20:57] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: (no justification provided) (duration: 00m 46s) [17:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:03] kostajh: Hey :) If you merge changes to mediawiki-config (even -labs only ones), please pull them to deploy1001 and sync them [17:22:01] !log hoo@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: (no justification provided) (duration: 00m 46s) [17:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:20] hoo: ok, didn't realize that was the practice. Looks like you just did it? ^ [17:22:26] 10Operations, 10Core Platform Team Backlog (Watching / External), 10HHVM, 10Patch-For-Review, and 2 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [17:22:40] kostajh: Yeah, I did it while I was on it [17:22:48] Just minimizes confusion [17:22:59] hoo: thanks! Is there a documentation page that explains the process, so I can review it? [17:23:08] probably [17:23:10] !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Enable Kartographer maps on testwikidatawiki (T184933) (duration: 00m 46s) [17:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:14] T184933: Display map for geocoordinate statements - https://phabricator.wikimedia.org/T184933 [17:24:43] kostajh: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment#Change_wiki_configuration [17:24:51] But that's not as explicit as it could be [17:25:05] hoo: thank you [17:25:33] (03CR) 10jenkins-bot: Enable Kartographer maps on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477817 (https://phabricator.wikimedia.org/T184933) (owner: 10Hoo man) [17:26:47] (03PS1) 10Ayounsi: Remove private IPs for labs-instance-transport1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/477819 (https://phabricator.wikimedia.org/T207663) [17:27:07] (03CR) 10Paladox: "https://puppet-compiler.wmflabs.org/compiler1002/56/ looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [17:28:24] 10Operations, 10Mail, 10WMF-Legal: Tracking down gary@ and redirecting it to trustandsafety@ - https://phabricator.wikimedia.org/T210464 (10RStallman-legalteam) Sorry I haven't responded! Not sure how to handle this one, but I will ask among the legal team and get back to the thread. [17:28:32] PROBLEM - Host kafka1013 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:43] (03CR) 10Dzahn: "why not set it to an empty array instead ?" [puppet] - 10https://gerrit.wikimedia.org/r/477814 (owner: 10Paladox) [17:29:15] cmjohnson1: are you working on kafka1013 by any chance? [17:29:41] checking console anyway [17:30:12] elukey I am in that rack [17:30:18] I may have touched something [17:30:48] RECOVERY - Host kafka1013 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [17:30:51] (03Abandoned) 10Paladox: phabricator::vcs: Allow String or Array as puppet type for listen_address [puppet] - 10https://gerrit.wikimedia.org/r/477814 (owner: 10Paladox) [17:31:23] elukey: most likely it was loose power cables since that is what I was doing [17:32:24] cmjohnson1: ah ok, perfect :) [17:40:48] (03PS30) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [17:40:53] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [17:41:01] 10Operations, 10ops-codfw, 10netops, 10Patch-For-Review, 10User-jijiki: codfw row D recable and add QFX - https://phabricator.wikimedia.org/T210467 (10jijiki) [17:43:01] 10Operations, 10Scoring-platform-team, 10Release-Engineering-Team (Watching / External): Contact number of some WMDE staff should be avalible to SRE/RelEng - https://phabricator.wikimedia.org/T210721 (10greg) Let me know if there's anything I can do, for now I'll just watch and respond as needed :) [17:51:51] (03CR) 10Arturo Borrero Gonzalez: "A records?" [dns] - 10https://gerrit.wikimedia.org/r/477819 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [17:52:18] RECOVERY - cassandra-c CQL 10.192.48.123:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.123 port 9042 [17:52:31] (03CR) 10Ayounsi: [C: 032] Remove private IPs for labs-instance-transport1-b-codfw [dns] - 10https://gerrit.wikimedia.org/r/477819 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [17:54:33] (03CR) 10Ayounsi: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/477819 (https://phabricator.wikimedia.org/T207663) (owner: 10Ayounsi) [17:55:32] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Renumber cloud-instance-transport1-b-eqiad to public IPs - https://phabricator.wikimedia.org/T207663 (10aborrero) @ayounsi and I did in real time both: 1) change routing in CRs 2) introduce the new default gateway for neutron: `l... [17:57:24] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/data/citation/{format}/{query} (Get citation for Darth Vader) timed out before a response was received [17:58:28] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [18:01:20] !log uploaded thumbor_6.3.2+git20170607-1+deb9u1 to stretch-wikimedia [18:01:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:42] (03PS31) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [18:06:06] (03PS32) 10Paladox: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 [18:06:12] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [18:18:55] (03PS4) 10Dzahn: phragile: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475032 (https://phabricator.wikimedia.org/T211228) [18:20:41] 10Operations, 10ops-codfw: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) @fgiunchedi please see below. Papual, while you are trying a different power source, my Linux software support would like to review the OS logs to make sure we have covered all possible causes. They... [18:23:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) a:05RobH>03Ottomata >>! In T207194#4800497, @Cmjohnson wrote: > @RobH > > all have been cabled and switch port updated minus the vlan.... [18:33:27] (03PS5) 10Dzahn: phragile: add stretch/PHP7 support [puppet] - 10https://gerrit.wikimedia.org/r/475032 (https://phabricator.wikimedia.org/T211228) [18:34:13] (03PS6) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [18:34:20] (03PS4) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [18:34:26] (03CR) 10Dzahn: [C: 032] "per IRC chat just now.. the instance using this is on jessie, so this changes nothing, just makes it possible to apply the same role on a " [puppet] - 10https://gerrit.wikimedia.org/r/475032 (https://phabricator.wikimedia.org/T211228) (owner: 10Dzahn) [18:34:35] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [18:35:17] (03PS1) 10Kosta Harlan: Fix formatting of help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477826 (https://phabricator.wikimedia.org/T211206) [18:36:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10RobH) [18:41:51] (03CR) 10Dzahn: "@Thifranc we talked about this on IRC, can you just add the stderr "2>" part back that you had in PS3? then we can go ahead with this. and" [puppet] - 10https://gerrit.wikimedia.org/r/470877 (https://phabricator.wikimedia.org/T150375) (owner: 10Thifranc) [18:46:11] (03PS1) 10Cmjohnson: Adding mgmt dns for ms-be1044-50 [dns] - 10https://gerrit.wikimedia.org/r/477829 (https://phabricator.wikimedia.org/T209618) [18:48:39] (03CR) 10Sbisson: [C: 032] "(I will sync this time)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477826 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [18:49:05] (03PS1) 10Andrew Bogott: Make cloudvirtan100x boxes eqiad1 labvirts [puppet] - 10https://gerrit.wikimedia.org/r/477830 (https://phabricator.wikimedia.org/T207194) [18:51:22] (03Merged) 10jenkins-bot: Fix formatting of help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477826 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [18:52:58] 10Operations, 10ops-codfw, 10Core Platform Team, 10Services (doing), and 2 others: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 (10Eevans) [18:53:41] (03PS2) 10Andrew Bogott: Make cloudvirtan100x boxes eqiad1 labvirts [puppet] - 10https://gerrit.wikimedia.org/r/477830 (https://phabricator.wikimedia.org/T207194) [18:53:47] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: Fix formatting of help panel links (duration: 00m 47s) [18:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:24] (03CR) 10Andrew Bogott: [C: 032] Make cloudvirtan100x boxes eqiad1 labvirts [puppet] - 10https://gerrit.wikimedia.org/r/477830 (https://phabricator.wikimedia.org/T207194) (owner: 10Andrew Bogott) [18:55:18] RECOVERY - cassandra-a SSL 10.192.48.124:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-a valid until 2020-11-29 09:26:20 +0000 (expires in 724 days) [18:55:19] (03CR) 10jenkins-bot: Fix formatting of help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477826 (https://phabricator.wikimedia.org/T211206) (owner: 10Kosta Harlan) [18:55:52] RECOVERY - cassandra-a service on restbase2018 is OK: OK - cassandra-a is active [18:57:05] (03PS2) 10Herron: add forward and reverse DNS for codfw logstash VMs [dns] - 10https://gerrit.wikimedia.org/r/477810 (https://phabricator.wikimedia.org/T211217) [18:57:28] (03CR) 10jerkins-bot: [V: 04-1] add forward and reverse DNS for codfw logstash VMs [dns] - 10https://gerrit.wikimedia.org/r/477810 (https://phabricator.wikimedia.org/T211217) (owner: 10Herron) [18:58:48] (03CR) 10Herron: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/477810 (https://phabricator.wikimedia.org/T211217) (owner: 10Herron) [18:59:13] fluke I guess... [18:59:28] (03CR) 10Herron: [C: 032] add forward and reverse DNS for codfw logstash VMs [dns] - 10https://gerrit.wikimedia.org/r/477810 (https://phabricator.wikimedia.org/T211217) (owner: 10Herron) [19:02:45] (03PS1) 10Bstorm: sonofgridengine: remove the rc links for the old sysV init as well [puppet] - 10https://gerrit.wikimedia.org/r/477833 (https://phabricator.wikimedia.org/T211055) [19:02:47] (03PS1) 10Herron: Revert "add forward and reverse DNS for codfw logstash VMs" [dns] - 10https://gerrit.wikimedia.org/r/477834 [19:03:32] (03CR) 10Herron: [C: 032] Revert "add forward and reverse DNS for codfw logstash VMs" [dns] - 10https://gerrit.wikimedia.org/r/477834 (owner: 10Herron) [19:03:47] (03CR) 10Bstorm: [C: 032] sonofgridengine: remove the rc links for the old sysV init as well [puppet] - 10https://gerrit.wikimedia.org/r/477833 (https://phabricator.wikimedia.org/T211055) (owner: 10Bstorm) [19:03:51] (03PS1) 10Herron: add forward and reverse DNS for codfw logstash VMs"" [dns] - 10https://gerrit.wikimedia.org/r/477835 (https://phabricator.wikimedia.org/T211217) [19:03:58] (03PS2) 10Bstorm: sonofgridengine: remove the rc links for the old sysV init as well [puppet] - 10https://gerrit.wikimedia.org/r/477833 (https://phabricator.wikimedia.org/T211055) [19:04:23] (03CR) 10Phuedx: EventLogging Logstash filter: move useful fields out of event (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [19:06:07] (03PS2) 10Herron: add forward and reverse DNS for codfw logstash VMs [dns] - 10https://gerrit.wikimedia.org/r/477835 (https://phabricator.wikimedia.org/T211217) [19:07:40] (03CR) 10Ottomata: EventLogging Logstash filter: move useful fields out of event (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [19:08:30] (03CR) 10Herron: [C: 032] add forward and reverse DNS for codfw logstash VMs [dns] - 10https://gerrit.wikimedia.org/r/477835 (https://phabricator.wikimedia.org/T211217) (owner: 10Herron) [19:10:47] 10Operations, 10DC-Ops, 10media-storage: ms-be raid setup / evaluation (currently using swraid on top of hwraid) - https://phabricator.wikimedia.org/T211231 (10RobH) p:05Triage>03Normal [19:12:27] !log cloudvirt1019 for an all inclusive part swap by HPE [19:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:04] andrewbogott bstorm_ ^^ [19:13:19] PROBLEM - Host cloudvirt1019 is DOWN: PING CRITICAL - Packet loss = 100% [19:13:22] 'all inclusive part swap'? [19:13:53] I guess we can ignore the page? [19:14:05] ok [19:14:12] I was worried for a sec [19:14:34] ACKNOWLEDGEMENT - Host cloudvirt1019 is DOWN: PING CRITICAL - Packet loss = 100% andrew bogott Chris is working on this [19:14:34] andrewbogott: new board, battery, cables, backplane [19:14:39] and there's the ack [19:14:41] thanks [19:14:44] wow, ok [19:15:08] if that doesn't fix it then I am out of answers [19:19:21] wow [19:19:28] Yeah, seriously [19:19:52] * bd808 gets ready to kick cloudvirt1019 into the creek [19:20:15] (03CR) 10Herron: "Adding Andrew and Luca for awareness -- Please see comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [19:20:25] we have had a lot of bad hardware luck in the cloud* space :( [19:21:54] (03PS1) 10Andrew Bogott: cloudvirtan partman: use xfs and a more convenient mount point [puppet] - 10https://gerrit.wikimedia.org/r/477839 [19:22:58] oo ok andrewbogott sorry didn't realize that stuff [19:23:56] (03CR) 10Gergő Tisza: EventLogging Logstash filter: move useful fields out of event (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [19:24:12] (03CR) 10Ottomata: [C: 031] "Eyebrow height remains steady. :p," [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [19:25:03] (03CR) 10Andrew Bogott: [C: 032] cloudvirtan partman: use xfs and a more convenient mount point [puppet] - 10https://gerrit.wikimedia.org/r/477839 (owner: 10Andrew Bogott) [19:26:09] (03CR) 10Ottomata: [C: 031] "Hm, thought, Kafka logs are configured via log4j. I think most of the logs also go to stdout, which are then shipped to syslog via journ" [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [19:29:23] (03CR) 10Ottomata: "It isn't always the url encoded json object, sometimes is is the just the json encoded string." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [19:30:37] (03PS3) 10Bstorm: sonofgridengine: remove the rc links for the old sysV init as well [puppet] - 10https://gerrit.wikimedia.org/r/477833 (https://phabricator.wikimedia.org/T211055) [19:37:03] (03CR) 10EBernhardson: "I took a closer look over the related configuration involved. The spawner offers a mode `isolate_devices` that will enable systemd `Privat" [puppet] - 10https://gerrit.wikimedia.org/r/477711 (https://phabricator.wikimedia.org/T211163) (owner: 10EBernhardson) [19:49:55] (03CR) 10Herron: "> Hm, thought, Kafka logs are configured via log4j. I think most of" [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [19:50:44] (03CR) 10Ottomata: [C: 032] "Erik I love that you can submit patches like this, finding the code in puppet. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/477711 (https://phabricator.wikimedia.org/T211163) (owner: 10EBernhardson) [19:50:51] (03PS2) 10Ottomata: Allow jupyterhub notebooks /dev/shm access [puppet] - 10https://gerrit.wikimedia.org/r/477711 (https://phabricator.wikimedia.org/T211163) (owner: 10EBernhardson) [19:50:55] (03CR) 10Ottomata: [V: 032 C: 032] Allow jupyterhub notebooks /dev/shm access [puppet] - 10https://gerrit.wikimedia.org/r/477711 (https://phabricator.wikimedia.org/T211163) (owner: 10EBernhardson) [19:53:10] (03CR) 10Dzahn: phabricator: Add support for php-fpm in stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [19:53:13] ebernhardson: i believe now if you restart your notebook server you'll pick up that change [19:54:07] (03CR) 10Paladox: phabricator: Add support for php-fpm in stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [19:54:26] (03CR) 10Ottomata: [C: 031] "Hm, yeah. Hm. I guess I meant log4j -> kafka then. But I'm fine with this this is great!" [puppet] - 10https://gerrit.wikimedia.org/r/476982 (https://phabricator.wikimedia.org/T63788) (owner: 10Herron) [19:58:29] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirtan1001 - https://phabricator.wikimedia.org/T211235 (10ops-monitoring-bot) [20:02:00] PROBLEM - MariaDB Slave Lag: s8 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1017.02 seconds [20:07:41] (03PS1) 10CRusnov: Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 [20:08:07] !log repooling labsdb1010 - T210693 [20:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:11] T210693: Create materialized views on Wiki Replica hosts for better query performance - https://phabricator.wikimedia.org/T210693 [20:08:33] (03CR) 10jerkins-bot: [V: 04-1] Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (owner: 10CRusnov) [20:12:02] (03PS1) 10Banyek: Revert "wiki replicas: depool labsdb1010 for testing materialized view" [puppet] - 10https://gerrit.wikimedia.org/r/477846 [20:13:28] (03CR) 10Banyek: [C: 032] Revert "wiki replicas: depool labsdb1010 for testing materialized view" [puppet] - 10https://gerrit.wikimedia.org/r/477846 (owner: 10Banyek) [20:13:39] (03PS2) 10Banyek: Revert "wiki replicas: depool labsdb1010 for testing materialized view" [puppet] - 10https://gerrit.wikimedia.org/r/477846 [20:14:12] (03PS33) 10Dzahn: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [20:15:38] (03CR) 10Gergő Tisza: "> It isn't always the url encoded json object, sometimes is is the just the json encoded string." [puppet] - 10https://gerrit.wikimedia.org/r/477419 (https://phabricator.wikimedia.org/T205437) (owner: 10Gergő Tisza) [20:19:12] (03CR) 10Dzahn: [C: 032] "compiler looks good https://puppet-compiler.wmflabs.org/compiler1002/13846/ disabling puppet on phab1001 (prod, jessie), deploying on pha" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [20:19:28] (03PS34) 10Dzahn: phabricator: Add support for php-fpm in stretch [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [20:22:08] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:23:22] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:25:30] (03CR) 10Dzahn: [C: 032] "on stretch, phab1002:" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [20:27:35] RECOVERY - Host cloudvirt1019 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:28:00] lol [20:28:15] at least if it's gonna be a page that's a good one to get [20:28:36] <_joe_> is someone working on that server? [20:28:40] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:28:45] <_joe_> are you aware we all get paged, right? [20:28:57] (03PS1) 10Paladox: phabricator: Add back debian version check to init.pp [puppet] - 10https://gerrit.wikimedia.org/r/477848 [20:29:29] (03PS2) 10Paladox: phabricator: Add back debian version check to init.pp [puppet] - 10https://gerrit.wikimedia.org/r/477848 [20:32:46] ACKNOWLEDGEMENT - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn adding fpm support [20:33:42] (03CR) 10Dzahn: [C: 032] phabricator: Add back debian version check to init.pp [puppet] - 10https://gerrit.wikimedia.org/r/477848 (owner: 10Paladox) [20:35:18] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Next), 10Services (next): Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) So, currently we only support sending to syslog via UDP using the [[ https://github.com/... [20:36:24] (03CR) 10Dzahn: [C: 032] "follow-up https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477848/ fixed the puppet run on jessie (phab2001)" [puppet] - 10https://gerrit.wikimedia.org/r/476985 (owner: 10Paladox) [20:36:49] yeah I peeked in earlier and mentioned [20:37:06] 10Operations, 10monitoring, 10User-CDanis: graph server temperature metrics - https://phabricator.wikimedia.org/T209863 (10CDanis) Haven't had too much time to look at this. FWIW I do see temperature sensors exported by the kernel on cp3007: ` cdanis@cp3007 ~ % paste <(ls -d1 /sys/class/thermal/thermal_zo... [20:39:08] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [20:42:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [20:44:40] RECOVERY - MariaDB Slave Lag: s8 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 267.17 seconds [20:45:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [20:46:20] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqsinvar-cache_type=Allvar-status_type=5 [20:46:22] PROBLEM - Device not healthy -SMART- on stat1004 is CRITICAL: cluster=analytics device=sde instance=stat1004:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=stat1004var-datasource=eqiad%2520prometheus%252Fops [20:51:12] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqsinvar-cache_type=Allvar-status_type=5 [20:53:44] (03PS1) 10Urbanecm: Fix bad namespace number for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477856 (https://phabricator.wikimedia.org/T205546) [20:58:27] (03PS1) 10CDanis: use FQDN in my zsh prompt [puppet] - 10https://gerrit.wikimedia.org/r/477857 [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181205T2100). [21:00:30] (03CR) 10CDanis: [C: 032] use FQDN in my zsh prompt [puppet] - 10https://gerrit.wikimedia.org/r/477857 (owner: 10CDanis) [21:04:10] PROBLEM - puppet last run on pc1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/cdanis/.zshfunc/prompt_cdanis1_setup] [21:04:38] (03PS1) 10Herron: install_server: add codfw logstash hosts and assign spare role [puppet] - 10https://gerrit.wikimedia.org/r/477859 (https://phabricator.wikimedia.org/T211217) [21:06:06] (03PS2) 10Herron: install_server: add codfw logstash hosts and assign spare role [puppet] - 10https://gerrit.wikimedia.org/r/477859 (https://phabricator.wikimedia.org/T211217) [21:08:00] (03CR) 10Herron: [C: 032] install_server: add codfw logstash hosts and assign spare role [puppet] - 10https://gerrit.wikimedia.org/r/477859 (https://phabricator.wikimedia.org/T211217) (owner: 10Herron) [21:12:12] !log arlolra@deploy1001 Started deploy [parsoid/deploy@5e9a496]: Updating Parsoid to a6058e3 [21:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:57] 10Operations, 10Mail, 10WMF-Legal: Tracking down gary@ and redirecting it to trustandsafety@ - https://phabricator.wikimedia.org/T210464 (10Jalexander) >>! In T210464#4776173, @Dzahn wrote: > Let's clean it up all at once and also do something with pat@ what about box6699@ in general. and what about the OTR... [21:20:42] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@243a503]: Update mobileapps to 2f44362 [21:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:09] RECOVERY - cassandra-a CQL 10.192.48.124:9042 on restbase2018 is OK: TCP OK - 0.036 second response time on 10.192.48.124 port 9042 [21:23:48] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@5e9a496]: Updating Parsoid to a6058e3 (duration: 11m 36s) [21:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:17] 10Operations, 10Performance-Team, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Errors trying to fetch RDF from Wikidata - https://phabricator.wikimedia.org/T207718 (10Smalyshev) 05Open>03Resolved Looks like with the change above errors are no longer happening. As I see no noticeable change on the... [21:27:16] (03PS2) 10CRusnov: Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 [21:28:03] thcipriani: ping? [21:28:10] (03CR) 10jerkins-bot: [V: 04-1] Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (owner: 10CRusnov) [21:28:42] (03PS1) 10Papaul: DNS: Add production and mgmt DNS entries for logstash200[1-3] [dns] - 10https://gerrit.wikimedia.org/r/477868 (https://phabricator.wikimedia.org/T211065) [21:31:00] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10Papaul) [21:33:40] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Andrew) I've created a proof-of-concept VM, hadoop-worker-01.cloud-analytics.eqiad.wmflabs. Please check that out and co... [21:34:55] RECOVERY - puppet last run on pc1009 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:36:01] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) @banyek, I was about to help @bmansurov do a manual import of his data into the recommendationapi database on m2-master, while the production... [21:36:50] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: rack/setup/install cloudvirtan100[1-5].eqiad.wmnet - https://phabricator.wikimedia.org/T207194 (10Andrew) 05Open>03Resolved [21:37:17] RECOVERY - cassandra-b SSL 10.192.48.125:7001 on restbase2018 is OK: SSL OK - Certificate restbase2018-b valid until 2020-11-29 09:26:21 +0000 (expires in 724 days) [21:37:37] RECOVERY - cassandra-b service on restbase2018 is OK: OK - cassandra-b is active [21:37:47] !log bootstrapping cassandra-b, restbase2018 -- T210843 [21:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:52] T210843: Reshape RESTBase Cassandra cluster for server refresh - https://phabricator.wikimedia.org/T210843 [21:37:54] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) [21:39:08] !log Updated Parsoid to a6058e3 (T210647, T208360, T205333) [21:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:14] T208360: Split Utils and DOMUtils into smaller chunks based on functionality - https://phabricator.wikimedia.org/T208360 [21:39:15] T210647: Paragraph wrapper introduces

in output HTML -- investigate and kill them where they are a result of edge case diffs between PHP parser and Parsoid code - https://phabricator.wikimedia.org/T210647 [21:39:15] T205333: Eliminate circular dependencies - https://phabricator.wikimedia.org/T205333 [21:39:28] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@243a503]: Update mobileapps to 2f44362 (duration: 18m 46s) [21:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:44] (03PS1) 10Bstorm: sonofgridengine: also make the service enable-able [puppet] - 10https://gerrit.wikimedia.org/r/477909 (https://phabricator.wikimedia.org/T211055) [21:40:01] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) @Banyek [[ https://gerrit.wikimedia.org/r/#/admin/projects/research/article-recommender/deploy | here ]]'s the repository that I'm planning... [21:40:40] (03PS1) 10Papaul: DHCP: Add MAC address entries for logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477911 (https://phabricator.wikimedia.org/T211065) [21:41:18] !log mobileapps deployment failed for group default03, rolling back and retrying [21:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:36] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@243a503]: Update mobileapps to 2f44362 [21:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:57] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) Ah, nm, my question is answered on this ticket: https://phabricator.wikimedia.org/T203039#4574768 This db is in a misc MySQL instance. [21:44:24] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@243a503]: Update mobileapps to 2f44362 (duration: 02m 47s) [21:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:15] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10Ottomata) Ok, @Banyek then the Q is: May I copy @bmansurov's data and import script over to a place that has access to m2-master:3306 (stat1007 does n... [21:47:06] for the record, the second try deploying mobileapps went fine [21:50:52] (03CR) 10Bstorm: [C: 032] sonofgridengine: also make the service enable-able [puppet] - 10https://gerrit.wikimedia.org/r/477909 (https://phabricator.wikimedia.org/T211055) (owner: 10Bstorm) [21:51:04] (03PS1) 10CDanis: more silly prompt tweaks [puppet] - 10https://gerrit.wikimedia.org/r/477914 [21:51:12] 10Operations, 10Wikimedia-Logstash: Procure and provision Logging pipeline hardware in multiple datacenters - https://phabricator.wikimedia.org/T205850 (10herron) [21:51:17] 10Operations, 10Wikimedia-Logstash, 10vm-requests, 10Patch-For-Review: Spin up 3 logstash/kibana frontend VMs in codfw - https://phabricator.wikimedia.org/T211217 (10herron) 05Open>03Resolved `logstash200[4-6]` have been built and are online with spare role. will update these to logstash role in the n... [21:52:24] (03PS2) 10CDanis: more silly prompt tweaks [puppet] - 10https://gerrit.wikimedia.org/r/477914 [21:53:06] (03CR) 10CDanis: [C: 032] more silly prompt tweaks [puppet] - 10https://gerrit.wikimedia.org/r/477914 (owner: 10CDanis) [22:02:53] (03PS2) 10Cmjohnson: Adding mgmt dns for ms-be1044-50 [dns] - 10https://gerrit.wikimedia.org/r/477829 (https://phabricator.wikimedia.org/T209618) [22:03:38] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for ms-be1044-50 [dns] - 10https://gerrit.wikimedia.org/r/477829 (https://phabricator.wikimedia.org/T209618) (owner: 10Cmjohnson) [22:04:56] 10Operations, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) [22:05:06] 10Operations, 10ops-eqiad, 10media-storage, 10Patch-For-Review: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10Cmjohnson) [22:07:02] 10Operations, 10Wikimedia-Logstash, 10service-runner, 10Core Platform Team Backlog (Next), 10Services (next): Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) Oh.... Apparently, apple have broken syslog in newer OSX versions, so I'm most certainly... [22:07:14] (03CR) 10CRusnov: Add an old hardware report (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [22:07:54] (03PS1) 10Bstorm: sonofgridengine: make shadowd enable-able as well [puppet] - 10https://gerrit.wikimedia.org/r/477919 (https://phabricator.wikimedia.org/T211055) [22:17:08] 10Operations: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) p:05Triage>03Low [22:17:17] 10Operations, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) [22:18:17] 10Operations, 10SRE-Access-Requests, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10jijiki) Pending SRE approval [22:19:14] (03PS1) 10Papaul: Partman: Add logstash200[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/477923 (https://phabricator.wikimedia.org/T211065) [22:20:01] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:21:52] 10Operations, 10SRE-Access-Requests, 10User-jijiki: Requesting access to `researchers` group for joewalsh - https://phabricator.wikimedia.org/T211115 (10dr0ptp4kt) Approved. [22:23:58] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [22:26:33] (03CR) 10Bstorm: [C: 032] sonofgridengine: make shadowd enable-able as well [puppet] - 10https://gerrit.wikimedia.org/r/477919 (https://phabricator.wikimedia.org/T211055) (owner: 10Bstorm) [22:28:14] (03CR) 10Volans: Add an old hardware report (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [22:28:28] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:15] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install codfw logstash elasticsearch storage servers - https://phabricator.wikimedia.org/T211065 (10Papaul) [22:31:52] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 9.686 second response time [22:37:44] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:43:25] !log restarting pdfreder on scb* hosts in eqiad [22:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:37] (03PS3) 10CRusnov: Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) [22:44:10] (03CR) 10jerkins-bot: [V: 04-1] Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [22:44:56] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.021 second response time [22:45:21] 10Operations, 10Traffic, 10netops: Free up 185.15.59.0/24 - https://phabricator.wikimedia.org/T211254 (10ayounsi) p:05Triage>03Low [22:47:41] (03PS1) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [22:49:40] (03PS2) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [22:53:46] (03PS4) 10CRusnov: Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) [22:54:35] (03CR) 10jerkins-bot: [V: 04-1] Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [22:55:01] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10danstillman) Re: latency, unfortunately somewhat slower performance is to be expected here, since v1 used Fi... [22:59:23] (03PS3) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [23:00:38] 10Operations, 10Citoid, 10Patch-For-Review, 10Services (done), 10VisualEditor (Current work): Transition citoid to use Zotero's translation-server-v2 - https://phabricator.wikimedia.org/T197242 (10ssastry) >>! In T197242#4801985, @danstillman wrote: > Re: latency, unfortunately somewhat slower performanc... [23:03:29] (03PS4) 10Paladox: profile::phabricator::httpd: Fix worker configs and also use hiera value [puppet] - 10https://gerrit.wikimedia.org/r/477925 [23:06:54] (03PS1) 10Bstorm: sonofgridengine: try setting config variables to tune the shadow master [puppet] - 10https://gerrit.wikimedia.org/r/477927 (https://phabricator.wikimedia.org/T211258) [23:09:01] (03CR) 10Bstorm: [C: 032] sonofgridengine: try setting config variables to tune the shadow master [puppet] - 10https://gerrit.wikimedia.org/r/477927 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [23:10:25] (03PS5) 10CRusnov: Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) [23:11:19] (03CR) 10jerkins-bot: [V: 04-1] Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [23:12:15] (03PS6) 10CRusnov: Add reports deployment to netbox profile [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) [23:15:56] (03PS1) 10Bstorm: sonofgridengine: create the dir for the override conf [puppet] - 10https://gerrit.wikimedia.org/r/477929 (https://phabricator.wikimedia.org/T211258) [23:16:31] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: create the dir for the override conf [puppet] - 10https://gerrit.wikimedia.org/r/477929 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [23:19:52] (03PS2) 10Bstorm: sonofgridengine: create the dir for the override conf [puppet] - 10https://gerrit.wikimedia.org/r/477929 (https://phabricator.wikimedia.org/T211258) [23:20:25] (03CR) 10jerkins-bot: [V: 04-1] sonofgridengine: create the dir for the override conf [puppet] - 10https://gerrit.wikimedia.org/r/477929 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [23:21:56] (03PS3) 10Bstorm: sonofgridengine: create the dir for the override conf [puppet] - 10https://gerrit.wikimedia.org/r/477929 (https://phabricator.wikimedia.org/T211258) [23:22:59] (03CR) 10Bstorm: [C: 032] sonofgridengine: create the dir for the override conf [puppet] - 10https://gerrit.wikimedia.org/r/477929 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [23:25:29] (03CR) 10CRusnov: "Test compile seems good:" [puppet] - 10https://gerrit.wikimedia.org/r/477845 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [23:26:31] (03PS1) 10Bstorm: sonofgridengine: make the shadow master override read only [puppet] - 10https://gerrit.wikimedia.org/r/477935 (https://phabricator.wikimedia.org/T211258) [23:27:55] (03CR) 10Bstorm: [C: 032] sonofgridengine: make the shadow master override read only [puppet] - 10https://gerrit.wikimedia.org/r/477935 (https://phabricator.wikimedia.org/T211258) (owner: 10Bstorm) [23:31:39] (03PS1) 10Cwhite: prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) [23:32:12] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite) [23:35:17] (03PS2) 10Cwhite: prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) [23:35:46] (03PS3) 10Cwhite: prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) [23:35:48] (03CR) 10jerkins-bot: [V: 04-1] prometheus: add directory size collector [puppet] - 10https://gerrit.wikimedia.org/r/477937 (https://phabricator.wikimedia.org/T211094) (owner: 10Cwhite) [23:42:02] (03CR) 10Cwhite: "Is there any reason to not merge this changeset?" [puppet] - 10https://gerrit.wikimedia.org/r/477620 (https://phabricator.wikimedia.org/T147326) (owner: 10Cwhite) [23:45:44] (03CR) 10CRusnov: Add an old hardware report (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [23:46:35] (03CR) 10CRusnov: "I know it's feature creep but perhaps this report should also include a test_old_inventory that would fail on INVENTORY items that are old" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [23:53:21] (03CR) 10Faidon Liambotis: "This one is a bit tricky! Haardware age can be more nuanced than the "5 year" rule. For instance we tend to replace some equipment in cycl" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [23:55:18] (03PS5) 10CRusnov: Add an old hardware report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) [23:56:36] (03CR) 10CRusnov: Add an old hardware report (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/477716 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [23:57:18] RECOVERY - cassandra-b CQL 10.192.48.125:9042 on restbase2018 is OK: TCP OK - 0.036 second response time on 10.192.48.125 port 9042 [23:58:10] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [23:59:22] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy