[01:05:17] (03PS1) 10Zoranzoki21: Enable RCPatrol on srwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472917 (https://phabricator.wikimedia.org/T209250) [01:11:38] (03PS1) 10Zoranzoki21: Enable autopatrol, patrol, rollback rights and RCPatrol on srwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472918 (https://phabricator.wikimedia.org/T209252) [01:13:45] (03PS1) 10Zoranzoki21: Remove duplicates of comments about task T206935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472919 (https://phabricator.wikimedia.org/T206935) [03:33:06] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 894.50 seconds [03:36:58] Hi, I no know what I should @cover here https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/OAuthAuthentication/+/472865/18/tests/phpunit/OAuthAuthExternalUserTest.php [03:39:29] (03PS8) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [03:39:53] (03CR) 10jerkins-bot: [V: 04-1] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [03:48:26] (03PS9) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [03:49:21] (03CR) 10jerkins-bot: [V: 04-1] profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) (owner: 10Mathew.onipe) [03:53:36] (03PS10) 10Mathew.onipe: profile::maps::osm_master: change osmupdater and osmimporter auth method to peer [puppet] - 10https://gerrit.wikimedia.org/r/466574 (https://phabricator.wikimedia.org/T206639) [04:04:28] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor wdqs::gui - Separate cron tasks from the module - https://phabricator.wikimedia.org/T209257 (10Mathew.onipe) p:05Triage>03Normal [04:08:06] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 144.72 seconds [04:50:38] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) Once the disk have failed we will get an automatic ticket for getting that disk replaced. I don't think we need this tracking taks. [05:34:03] 10Operations, 10ops-codfw, 10DC-Ops: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T208838 (10Marostegui) this is finished ` root@db2049:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337DD260) Port Name: 1I Port Name: 2I Gen8 ServBP 1... [05:36:44] 10Operations, 10DBA: db2061 has predictive disk errors - https://phabricator.wikimedia.org/T208957 (10Marostegui) I would ignore this until the disks fail and we get the automatic failed disk task created [05:39:06] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) 05Open>03Resolved I agree with Jaime, a bigger stripe size shouldn't be an issue. Plus, we will be having SSDs, which will probably compensa... [05:39:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Banyek: rack/setup/install pc2007-pc2010 - https://phabricator.wikimedia.org/T207259 (10Marostegui) [05:54:10] (03CR) 10Marostegui: "Nice work! One comment to make things a bit clearer" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [06:20:02] 10Operations, 10ops-codfw: Degraded RAID on db2048 - https://phabricator.wikimedia.org/T208141 (10Marostegui) Better to normally wait to close the task until the raid is back to optimal - see the explanation I already gave at: T207212#4677602 [06:28:22] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:28:22] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:29:32] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [06:31:21] PROBLEM - puppet last run on cp1080 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:32:07] 10Operations, 10ops-eqiad, 10DBA: db1117 went away - https://phabricator.wikimedia.org/T208150 (10Marostegui) 05Open>03Resolved a:03Cmjohnson I see no more errors on the idrac logs since the reboot. Let's close this and re-open if this happens again and then we'll need to get the vendor involved. ` E... [06:34:31] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Marostegui) a:05Papaul>03None I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails) Should we ease a bit replication options to make... [06:53:21] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time [06:54:22] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.064 second response time [06:56:52] RECOVERY - puppet last run on cp1080 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:01] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:01] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:00:12] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:04:40] (03CR) 10Elukey: [C: 032] Relase new upstream version [debs/prometheus-mcrouter-exporter] (debian) - 10https://gerrit.wikimedia.org/r/472628 (https://phabricator.wikimedia.org/T208375) (owner: 10Elukey) [07:15:06] AndyRussG: o/ - whenever you have time can you review T203669 ? [07:15:07] T203669: Return to real time banner impressions in Druid - https://phabricator.wikimedia.org/T203669 [07:20:17] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) [07:20:24] 10Operations: Integrate Stretch 9.6 point update - https://phabricator.wikimedia.org/T209260 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:28:30] (03PS1) 10Elukey: mcrouter: fix initial probe timeout command line option [puppet] - 10https://gerrit.wikimedia.org/r/472926 (https://phabricator.wikimedia.org/T203786) [07:28:47] _joe_ ---^ this is me doing copy/paste like a boss [07:29:29] <_joe_> lemme see [07:29:41] <_joe_> ahahahah [07:29:44] yeah [07:29:49] (03CR) 10Giuseppe Lavagetto: [C: 031] mcrouter: fix initial probe timeout command line option [puppet] - 10https://gerrit.wikimedia.org/r/472926 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [07:30:10] I was testing the new prometheus exporer and I saw mcrouter complaining about it after the restart [07:31:12] <_joe_> lol [07:31:24] (03CR) 10Elukey: [C: 032] mcrouter: fix initial probe timeout command line option [puppet] - 10https://gerrit.wikimedia.org/r/472926 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [07:53:38] (03PS1) 10Elukey: profile::prometheus::mcrouter_exporter: alway set listen address [puppet] - 10https://gerrit.wikimedia.org/r/472936 (https://phabricator.wikimedia.org/T208375) [07:58:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13429/mw2204.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/472936 (https://phabricator.wikimedia.org/T208375) (owner: 10Elukey) [08:07:58] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) Already caught up with Jaime about why this ticket exists. All good here [08:08:26] (03PS2) 10Elukey: aptrepo: update cloudera cdh to 5.15 [puppet] - 10https://gerrit.wikimedia.org/r/472410 (https://phabricator.wikimedia.org/T204759) [08:09:28] (03CR) 10Elukey: [C: 032] aptrepo: update cloudera cdh to 5.15 [puppet] - 10https://gerrit.wikimedia.org/r/472410 (https://phabricator.wikimedia.org/T204759) (owner: 10Elukey) [08:12:47] (03PS2) 10Filippo Giunchedi: swift: disable free inode btree at mkfs time [puppet] - 10https://gerrit.wikimedia.org/r/472415 (https://phabricator.wikimedia.org/T199198) [08:12:53] (03CR) 10Filippo Giunchedi: [C: 032] swift: disable free inode btree at mkfs time [puppet] - 10https://gerrit.wikimedia.org/r/472415 (https://phabricator.wikimedia.org/T199198) (owner: 10Filippo Giunchedi) [08:16:06] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) 05stalled>03Resolved This is completed now! New xfs filesystems provisioned by puppet will also contain the right flags t... [08:16:12] (03Abandoned) 10Filippo Giunchedi: WIP swift-reformat-device [puppet] - 10https://gerrit.wikimedia.org/r/472414 (owner: 10Filippo Giunchedi) [08:16:39] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: rollout rsyslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/472396 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [08:16:48] (03PS4) 10Filippo Giunchedi: hieradata: rollout rsyslog_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/472396 (https://phabricator.wikimedia.org/T205849) [08:23:00] !log temporarily disable puppet in codfw before enabling rsyslog_exporter [08:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:37] <_joe_> !log uploading new php-{luasandbox,wikidiff2} to stretch main component, rebuild php-{luasandbox,wikidiff2,geoip,msgpack} for php 7.2, upload to stretch component php72, T208433 [08:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:39] T208433: Package and install php 7.2 in place of php 7.0 - https://phabricator.wikimedia.org/T208433 [08:27:52] 10Operations, 10ops-codfw, 10DBA, 10User-Banyek: db2042 (m3) master RAID battery failed - https://phabricator.wikimedia.org/T202051 (10Marostegui) This failed again and I have created T202051 to track it [08:29:10] (03PS1) 10Muehlenhoff: Update tor archive key [puppet] - 10https://gerrit.wikimedia.org/r/472939 [08:30:52] (03CR) 10Elukey: [C: 031] Update tor archive key [puppet] - 10https://gerrit.wikimedia.org/r/472939 (owner: 10Muehlenhoff) [08:33:10] (03PS2) 10Muehlenhoff: Update tor archive key [puppet] - 10https://gerrit.wikimedia.org/r/472939 [08:33:54] (03PS1) 10Filippo Giunchedi: site: fix role() indentation [puppet] - 10https://gerrit.wikimedia.org/r/472940 [08:33:56] (03PS1) 10Filippo Giunchedi: hieradata: remove rsyslog_exporter rollout from regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/472941 [08:35:27] (03CR) 10Muehlenhoff: [C: 032] Update tor archive key [puppet] - 10https://gerrit.wikimedia.org/r/472939 (owner: 10Muehlenhoff) [08:41:24] !log Change sync_binlog to 0 and trx_commit to 2 on dbstore2002:3313 to let it catch up [08:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:00] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Marostegui) >>! In T208320#4738624, @Marostegui wrote: > I guess this is the reason why dbstore2002:3313 is lagging so much behind? (It could be something else, I am just catching up on emails) > Should we ease a bit re... [08:46:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472944 (https://phabricator.wikimedia.org/T203709) [08:49:13] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472944 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [08:50:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472944 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [08:52:12] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471345 (https://phabricator.wikimedia.org/T207544) (owner: 10Arcayn) [08:52:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3318 (duration: 01m 01s) [08:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:46] !log Deploy schema change on db1099:3318 - T203709 [08:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:49] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [08:58:55] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472944 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [09:04:59] (03CR) 10Hashar: "The reviewers and reviewers-by-blame looks great (doc: https://gerrit.googlesource.com/plugins/reviewers/+/master/src/main/resources/Docum" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/472003 (https://phabricator.wikimedia.org/T205784) (owner: 10Thcipriani) [09:08:25] (03PS1) 10Muehlenhoff: Revert "Update tor archive key" [puppet] - 10https://gerrit.wikimedia.org/r/472946 [09:09:30] (03CR) 10Muehlenhoff: [C: 032] Revert "Update tor archive key" [puppet] - 10https://gerrit.wikimedia.org/r/472946 (owner: 10Muehlenhoff) [09:11:50] (03PS2) 10Elukey: profile::prometheus::mcrouter_exporter: alway set listen address [puppet] - 10https://gerrit.wikimedia.org/r/472936 (https://phabricator.wikimedia.org/T208375) [09:12:41] !log Deploy schema change on db2048 (s1 codfw master) (replication will be stopped) - T67448 [09:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:44] T67448: Dropping rc_cur_time on wmf databases - https://phabricator.wikimedia.org/T67448 [09:16:08] (03PS1) 10Filippo Giunchedi: hieradata: move enable_rsyslog_exporter into hieradata/profile [puppet] - 10https://gerrit.wikimedia.org/r/472948 (https://phabricator.wikimedia.org/T205849) [09:17:23] (03CR) 10Arturo Borrero Gonzalez: cloudvps: eqiad1: add cloudinstances2b virtual router FQDNs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/460320 (https://phabricator.wikimedia.org/T202886) (owner: 10Arturo Borrero Gonzalez) [09:18:17] (03PS20) 10Urbanecm: [tests] Ensure only existing wikis are referenced from IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/467448 (https://phabricator.wikimedia.org/T115138) [09:25:41] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13433/" [puppet] - 10https://gerrit.wikimedia.org/r/472948 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [09:25:50] (03PS2) 10Filippo Giunchedi: hieradata: move enable_rsyslog_exporter into hieradata/profile [puppet] - 10https://gerrit.wikimedia.org/r/472948 (https://phabricator.wikimedia.org/T205849) [09:27:23] (03PS3) 10Filippo Giunchedi: hieradata: enable rsyslog_exporter in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/472948 (https://phabricator.wikimedia.org/T205849) [09:31:48] (03PS4) 10Filippo Giunchedi: hieradata: enable rsyslog_exporter in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/472948 (https://phabricator.wikimedia.org/T205849) [09:32:46] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: enable rsyslog_exporter in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/472948 (https://phabricator.wikimedia.org/T205849) (owner: 10Filippo Giunchedi) [09:33:33] 10Operations, 10DBA: BBU Fail on dbstore2002 - https://phabricator.wikimedia.org/T208320 (10Banyek) s3 is the only section there which is not compressed. Btw. We can check if the BBU causes it, because if we enable write caching we can see the results. [09:38:21] PROBLEM - MariaDB Slave Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 704.53 seconds [09:38:31] ^ that is me [09:42:09] 10Puppet: Validate no namespaced keys are present in hieradata/*.yaml - https://phabricator.wikimedia.org/T209265 (10fgiunchedi) [09:42:36] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10ArielGlenn) [09:43:28] I got no page (codfw though so perhaps expected) [09:46:25] (03CR) 10Filippo Giunchedi: [C: 032] site: fix role() indentation [puppet] - 10https://gerrit.wikimedia.org/r/472940 (owner: 10Filippo Giunchedi) [09:46:42] (03PS2) 10Filippo Giunchedi: site: fix role() indentation [puppet] - 10https://gerrit.wikimedia.org/r/472940 [09:47:45] (03CR) 10Filippo Giunchedi: [C: 032] hieradata: remove rsyslog_exporter rollout from regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/472941 (owner: 10Filippo Giunchedi) [09:47:54] (03PS2) 10Filippo Giunchedi: hieradata: remove rsyslog_exporter rollout from regex.yaml [puppet] - 10https://gerrit.wikimedia.org/r/472941 [09:48:50] (03CR) 10ArielGlenn: "The uid in ldap for the account wmde-leszek is 12300; can you say where the 12300 number came from?" [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) (owner: 10Cwhite) [09:49:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for WMDE-leszek - https://phabricator.wikimedia.org/T208717 (10ArielGlenn) I'm unsure about the uid in the patchset and have pinged @cwhite for an update. [09:53:24] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Requesting access to netbox for bd808 - https://phabricator.wikimedia.org/T208267 (10ArielGlenn) What's next on this, @MoritzMuehlenhoff ? [09:55:05] (03PS1) 10Muehlenhoff: Remove obsolete cloudera-trusty repository [puppet] - 10https://gerrit.wikimedia.org/r/472950 [09:55:12] RECOVERY - MariaDB Slave Lag: s1 on db2094 is OK: OK slave_sql_lag Replication lag: 0.51 seconds [09:55:17] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-jijiki: Requesting access to deployment, operational logs, and analytics cluster for jlinehan - https://phabricator.wikimedia.org/T207951 (10ArielGlenn) @jlinehan please let us know that ssh is working for you; then we can close this ticket. T... [09:57:56] !log upgraded cdh packages (cdh 5.10 -> 5.15) for thirdparty/cloudera in jessie/stretch-wikimedia [09:57:56] 10Operations, 10SRE-Access-Requests: Requesting access to Jupyter notebook / analytics-privatedata-users for jgleeson - https://phabricator.wikimedia.org/T208432 (10ArielGlenn) [09:57:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:39] 10Operations, 10SRE-Access-Requests: Add Mukunda to releasers-mediawiki - https://phabricator.wikimedia.org/T209176 (10ArielGlenn) [10:10:40] (03PS1) 10ArielGlenn: add twentyafterfour to releasers-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/472951 (https://phabricator.wikimedia.org/T209176) [10:10:57] (03PS3) 10Elukey: profile::prometheus::mcrouter_exporter: alway set listen address [puppet] - 10https://gerrit.wikimedia.org/r/472936 (https://phabricator.wikimedia.org/T208375) [10:11:08] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Add Mukunda to releasers-mediawiki - https://phabricator.wikimedia.org/T209176 (10ArielGlenn) [10:13:31] (03CR) 10Filippo Giunchedi: [C: 031] profile::prometheus::mcrouter_exporter: alway set listen address [puppet] - 10https://gerrit.wikimedia.org/r/472936 (https://phabricator.wikimedia.org/T208375) (owner: 10Elukey) [10:13:47] (03CR) 10Elukey: [C: 032] profile::prometheus::mcrouter_exporter: alway set listen address [puppet] - 10https://gerrit.wikimedia.org/r/472936 (https://phabricator.wikimedia.org/T208375) (owner: 10Elukey) [10:15:46] 10Operations, 10SRE-Access-Requests, 10netops, 10Patch-For-Review: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 (10ArielGlenn) @ayounsi What is needed for this to move ahead? [10:16:46] (03PS1) 10Filippo Giunchedi: hieradata: add a note about namespaced keys [puppet] - 10https://gerrit.wikimedia.org/r/472952 (https://phabricator.wikimedia.org/T209265) [10:21:04] 10Operations, 10SRE-Access-Requests, 10netops, 10Patch-For-Review: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 (10ayounsi) I'll take care of it, I already gave him access to one device we use for tests, and need to find time to push his access everywhere. Thi... [10:22:06] (03PS2) 10Filippo Giunchedi: hieradata: add a note about namespaced keys [puppet] - 10https://gerrit.wikimedia.org/r/472952 (https://phabricator.wikimedia.org/T209265) [10:23:58] (03CR) 10Giuseppe Lavagetto: [C: 031] hieradata: add a note about namespaced keys [puppet] - 10https://gerrit.wikimedia.org/r/472952 (https://phabricator.wikimedia.org/T209265) (owner: 10Filippo Giunchedi) [10:25:18] hey all! i'm going to pull https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/472596/ onto mwmaint1002 and start running it [10:26:04] per marostegui's advice, i'm going to run it against testwiki and a couple of small wikis first [10:26:59] since it got merged on friday, should i scap pull on mwmaint1002? [10:27:18] phuedx: make sure to !log when you start it :) [10:27:32] ack [10:27:47] thank you [10:29:22] (03CR) 10Filippo Giunchedi: [C: 032] "CC'd people that have namespaced keys in hieradata/common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/472952 (https://phabricator.wikimedia.org/T209265) (owner: 10Filippo Giunchedi) [10:30:06] jan_drewniak: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181112T1030). [10:32:01] !log upload mcrouter exporter 0.0.0+git20181106 to stretch-wikimedia [10:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:15] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: re-create script for manual paging - https://phabricator.wikimedia.org/T82937 (10aborrero) Hi, I've been trying to use this script today. Here is my experience: ` aborrero@einsteinium:~ $ sudo /usr/local/bin/icinga-sms -l [...] # list of people, b... [10:37:11] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10ArielGlenn) [10:37:49] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472954 [10:38:34] any recommendations for how to get a maint script from master onto mwmaint1002 cleanly (i.e. not copying manually) [10:38:50] (03CR) 10DCausse: elasticsearch_cluster: multi-cluster/multi-instance support (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/468558 (https://phabricator.wikimedia.org/T207918) (owner: 10Mathew.onipe) [10:39:49] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472954 (owner: 10Marostegui) [10:41:54] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10ArielGlenn) @Addshore It looks like you need write access to /var/lib/carbon/whisper on graphite1001 and 2001, which means being abl... [10:42:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472954 (owner: 10Marostegui) [10:43:09] (03CR) 10Filippo Giunchedi: [C: 04-1] "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472694 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [10:43:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1099:3318 (duration: 00m 53s) [10:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:07] hashar, zeljkof: not sure if you're around: any recommendations for how to get a maint script from master onto mwmaint1002 cleanly (i.e. not copying manually)? [10:45:18] !log Deploy schema change on db2048 (s1 codfw master), this will generate lag on s1 codfw - T51191 [10:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:20] T51191: Dropping rc_moved_to_title/rc_moved_to_ns on wmf databases - https://phabricator.wikimedia.org/T51191 [10:47:28] (03PS1) 10Marostegui: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472955 (https://phabricator.wikimedia.org/T203709) [10:48:19] !log upload mtail 3.0.0~rc5-1~bpo9+1wmf1 to stretch-wikimedia [10:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:53] addshore: you're listed as a deployer for the next window too so you don't escape the pings! [10:49:14] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472955 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [10:49:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:50:42] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472955 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [10:51:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1101:3318 (duration: 00m 55s) [10:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:58] backporting it to the current deployment branch and updating the deployment host seems like the best way to go [10:54:35] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1099:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472954 (owner: 10Marostegui) [10:54:39] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1101:3318 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472955 (https://phabricator.wikimedia.org/T203709) (owner: 10Marostegui) [10:56:38] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10Addshore) As far as I remember from the last time I discussed this with someone there are a variety of scripts that allow you to per... [10:56:50] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10Addshore) [11:01:04] 10Operations, 10monitoring, 10Patch-For-Review: Provision >= 50% of statsd/Graphite-only metrics in Prometheus - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [11:03:48] phuedx: I'm around [11:03:53] zeljkof: o/ [11:04:07] (03PS1) 10ArielGlenn: add mmarble as ldap user [puppet] - 10https://gerrit.wikimedia.org/r/472957 (https://phabricator.wikimedia.org/T208431) [11:04:20] I'm not sure I understood you, what do you want to do? [11:04:36] i've got a maint script in mw master that i'd like to get onto mwmaint1002 cleanly (hopefully just a scap pull on that server) [11:04:41] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/472596/ [11:04:50] and this is pretty much all I know about swat :D [11:04:51] https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [11:05:58] should i backport it to 1.33.0-wmf.3 and update the deployment server? should i wait for the deployment window? [11:06:17] zeljkof: your documentation is greatly appreciated. trust me <3 [11:06:42] (03CR) 10ArielGlenn: [C: 032] add mmarble as ldap user [puppet] - 10https://gerrit.wikimedia.org/r/472957 (https://phabricator.wikimedia.org/T208431) (owner: 10ArielGlenn) [11:06:52] phuedx: is it urgent, or can it wait until SWAT? [11:07:04] i guess swat is in 50 minutes, right> [11:07:06] *? [11:07:12] or did you think about a separate deploy window? [11:07:24] yes, swat is in less than an hour [11:07:44] alright. i'll cherry-pick the change and add it to the SWAT queue [11:08:07] I've never deployed a script, I think, hashar do you have any recommendations? ^ [11:09:28] as long as it's on the deployment host, then it can be pulled onto the maintenance server [11:09:39] i'm on the hook for running it :) [11:10:18] i'm unsure of the protocol here and i don't want to disrupt deployments [11:10:36] (by, y'know, updating deployment hosts outside of deployment windows etc) [11:12:31] (03PS1) 10ArielGlenn: Add kchapman as ldap only user [puppet] - 10https://gerrit.wikimedia.org/r/472959 (https://phabricator.wikimedia.org/T208949) [11:13:27] (03CR) 10ArielGlenn: [C: 032] Add kchapman as ldap only user [puppet] - 10https://gerrit.wikimedia.org/r/472959 (https://phabricator.wikimedia.org/T208949) (owner: 10ArielGlenn) [11:17:37] (03PS1) 10ArielGlenn: add kzimmerman as ldap only user [puppet] - 10https://gerrit.wikimedia.org/r/472961 (https://phabricator.wikimedia.org/T208822) [11:18:37] (03CR) 10ArielGlenn: [C: 032] add kzimmerman as ldap only user [puppet] - 10https://gerrit.wikimedia.org/r/472961 (https://phabricator.wikimedia.org/T208822) (owner: 10ArielGlenn) [11:26:20] !log installing Java security updates on elastic* [11:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:19] (03PS1) 10Joal: Update AQS druid datasource to 2018-10 [puppet] - 10https://gerrit.wikimedia.org/r/472963 [11:37:09] !log contint1001 : cleaning disk | T209123 ? [11:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:12] T209123: contint1001 build cleanup - https://phabricator.wikimedia.org/T209123 [11:37:25] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set a harbor registry for testing - https://phabricator.wikimedia.org/T209271 (10fselles) p:05Triage>03Normal [11:38:13] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Kubernetes: set a harbor registry for testing - https://phabricator.wikimedia.org/T209271 (10fselles) a:03fselles [11:38:22] (03PS1) 10Effie Mouzeli: role::codfw::scb: switch rdb2001:6382 with rdb2003:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472964 (https://phabricator.wikimedia.org/T206450) [11:40:05] 10Operations, 10Patch-For-Review, 10User-Joe, 10User-jijiki: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 (10jijiki) [11:43:05] (03CR) 10jerkins-bot: [V: 04-1] role::codfw::scb: switch rdb2001:6382 with rdb2003:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472964 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [11:47:33] (03PS2) 10Effie Mouzeli: role::codfw::scb: switch rdb2001:6382 with rdb2003:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472964 (https://phabricator.wikimedia.org/T206450) [11:49:09] !log updating puppet CI job for mtail upgrade https://gerrit.wikimedia.org/r/#/c/integration/config/+/472962/ [11:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:12] (03CR) 10Effie Mouzeli: "https://puppet-compiler.wmflabs.org/compiler1002/13435/scb2004.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/472964 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [12:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181112T1200). [12:00:04] Zoranzoki21 and phuedx: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:16] !log uploaded jenkins 2.138.3 to apt.wikimedia.org (jessie and stretch) [12:00:17] o/ [12:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:30] I can SWAT today [12:00:45] Zoranzoki21: around for swat? [12:01:54] phuedx: do you want to deploy your patch yourself? [12:02:44] zeljkof: not at my main machine right now [12:02:46] ^ hashar [12:02:56] i can be in 10 minutes -- but my keys aren't on this machine [12:03:24] phuedx: there's no rush, if you want to deploy in 10 minutes [12:03:31] or you can't deploy at all? [12:03:38] I'm not sure I've understood you :) [12:03:41] (03CR) 10Effie Mouzeli: [C: 032] role::codfw::scb: switch rdb2001:6382 with rdb2003:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472964 (https://phabricator.wikimedia.org/T206450) (owner: 10Effie Mouzeli) [12:03:56] (03PS3) 10Effie Mouzeli: role::codfw::scb: switch rdb2001:6382 with rdb2003:6379 [puppet] - 10https://gerrit.wikimedia.org/r/472964 (https://phabricator.wikimedia.org/T206450) [12:03:58] i can deploy it but i'll be a little delayed as i'm on my little laptop right now [12:05:03] phuedx: I prefer when deployers deploy themselves :) but I can deploy if you prefer [12:07:11] zeljkof: on my main machine now [12:07:21] phuedx: go ahead then [12:07:35] ack [12:10:30] * phuedx waits on jenkins [12:15:16] !log Restarting nutcracker on scb200[1-6] - T206450 [12:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:20] T206450: Reorganize our redis rdb1/rdb2 clusters - https://phabricator.wikimedia.org/T206450 [12:15:37] zeljkof, hashar: https://integration.wikimedia.org/ci/job/mediawiki-quibble-composer-mysql-php70-docker/9376/ failed with the following message: [12:15:41] 12:09:31 npm WARN deprecated circular-json@0.3.3: CircularJSON is in maintenance only, flatted is its successor. [12:15:41] 12:09:38 npm WARN deprecated nomnom@1.8.1: Package no longer supported. Contact support@npmjs.com for more info. [12:15:49] (not related to the change) [12:17:04] phuedx: and it failed the build? [12:17:08] * zeljkof is looking [12:17:18] not sure yet, there's ~4 minutes left [12:18:09] I think this is the error [12:18:19] `npm ERR! shasum check failed for /tmp/npm-441-ae4be000/registry.npmjs.org/JSV/-/JSV-4.0.2.tgz` [12:18:26] probably a problem with chache [12:18:28] cache [12:18:46] just re-running the job should fix it, I think, hashar would know more [12:18:52] (03PS1) 10ArielGlenn: Fix up the ldap only entry for kzimmerman [puppet] - 10https://gerrit.wikimedia.org/r/472968 (https://phabricator.wikimedia.org/T208822) [12:19:58] (03CR) 10ArielGlenn: [C: 032] Fix up the ldap only entry for kzimmerman [puppet] - 10https://gerrit.wikimedia.org/r/472968 (https://phabricator.wikimedia.org/T208822) (owner: 10ArielGlenn) [12:20:39] ugh. it's a ~12 minutes build time. i might just c+2 v+2 [12:20:52] ~14 even [12:25:15] * phuedx twiddles fingers [12:25:18] more like ~20 ;) [12:25:35] phuedx: you can remove +2 from 472958 (set back to 0), then +2 again [12:25:46] it will take another 20 minutes or so :/ [12:26:00] but the job _should_ pass the second time, it looks like CI problem [12:27:24] hrm. looks like zuul hasn't picked up the the removal and then +2 [12:29:06] not even with a c -2 [12:30:19] I'm not sure why it didn't cancel the jobs, I can stop them manually [12:31:23] one of my jobs timed our on jenkins as well, the first time [12:31:42] dunno if it was random, I resubmitted it and it went through [12:32:26] (03PS2) 10Elukey: Update AQS druid datasource to 2018-10 [puppet] - 10https://gerrit.wikimedia.org/r/472963 (owner: 10Joal) [12:32:33] phuedx: try now with +2 [12:32:38] thanks, zeljkof. trying again [12:36:54] phuedx: argh, again `npm ERR! shasum check failed for /tmp/npm-556-e89a88a6/registry.npmjs.org/mwbot/-/mwbot-1.0.10.tgz` [12:37:02] https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-hhvm-docker/9630/console [12:37:07] zeljkof: i see it [12:37:12] I'm not sure what's wrong with CI :/ [12:37:21] :( [12:37:22] hashar: are you around to help with CI problems? [12:37:47] the last time it was a different package [12:37:52] (03CR) 10Elukey: [C: 032] Update AQS druid datasource to 2018-10 [puppet] - 10https://gerrit.wikimedia.org/r/472963 (owner: 10Joal) [12:38:05] I guess one or more containers has some kind of npm package cache problem [12:38:25] zeljkof: can you abort the job. i'd prefer to get this change out soon so that i can start running the script [12:38:54] phuedx: sure, remove your vote in gerrit, I'll kill the jobs [12:39:00] let me +2 it now [12:39:09] so I can more quickly react if it fails again [12:39:09] done [12:41:44] +2ed, monitoring zuul [12:45:06] (03PS1) 10Effie Mouzeli: install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) [12:45:23] (03PS1) 10Ema: ATS: allow specifying remap-rule-specific Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/472971 (https://phabricator.wikimedia.org/T209021) [12:45:25] (03PS1) 10Ema: ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) [12:45:59] (03CR) 10jerkins-bot: [V: 04-1] ATS: allow specifying remap-rule-specific Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/472971 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [12:46:15] (03CR) 10jerkins-bot: [V: 04-1] ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [12:47:08] phuedx: looks like no jobs are failing now, for now [12:47:21] 10Operations, 10Wikimedia-Mailing-lists: Requesting creation of librarycard-dev mailing list - https://phabricator.wikimedia.org/T209081 (10ArielGlenn) Am I right in assuming this list's archives would be public? [12:47:22] I'll merge and deploy one throttle change while waiting for this one [12:48:19] (03PS2) 10Ema: ATS: allow specifying remap-rule-specific Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/472971 (https://phabricator.wikimedia.org/T209021) [12:48:31] (03CR) 10Zfilipin: "Not deployed during swat today because the developer was not around." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471534 (https://phabricator.wikimedia.org/T208663) (owner: 10Zoranzoki21) [12:48:58] (03PS2) 10Effie Mouzeli: install_server: reimage rdb2001, rdb2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/472970 (https://phabricator.wikimedia.org/T206450) [12:50:02] (03PS2) 10Ema: ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) [12:50:07] (03PS4) 10Zfilipin: Add new throttle rule for Wikipedia event in Ireland on 2018-11-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472691 (https://phabricator.wikimedia.org/T209037) (owner: 10Zoranzoki21) [12:50:20] Zfilipin: I am here [12:50:30] Zfilipin: I wrote you reason in PM for lating [12:50:51] 10Operations, 10Puppet, 10Proposal, 10cloud-services-team (Kanban): Revisit and update python testing in puppet - https://phabricator.wikimedia.org/T209189 (10Volans) Thanks @Bstorm for formalizing our random IRC chat into this proposal 😉 I would add that it would be surreal to think to be able to have c... [12:50:56] (03CR) 10jerkins-bot: [V: 04-1] ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [12:51:13] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472691 (https://phabricator.wikimedia.org/T209037) (owner: 10Zoranzoki21) [12:52:07] zeljkof: unrelated failure in the flow extension. is it reasonable to c+2 v+2 [12:52:40] (03Merged) 10jenkins-bot: Add new throttle rule for Wikipedia event in Ireland on 2018-11-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472691 (https://phabricator.wikimedia.org/T209037) (owner: 10Zoranzoki21) [12:52:51] phuedx: argh, again something? let me see [12:53:07] zeljkof: not an npm issue here but a test failure in another extension [12:53:13] (not core, not a maintenance script) [12:54:04] (03PS2) 10Giuseppe Lavagetto: mediawiki: prune unused files and templates [puppet] - 10https://gerrit.wikimedia.org/r/472126 [12:54:49] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:472691|Add new throttle rule for Wikipedia event in Ireland on 2018-11-13 (T209037)]] (duration: 00m 53s) [12:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:52] Zoranzoki21: 472691 is deployed [12:54:52] T209037: Throttle exemption for Wikipedia event in Ireland on 2018-11-13 - https://phabricator.wikimedia.org/T209037 [12:55:19] zeljkof: Ok [12:55:39] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/472667 (owner: 10Filippo Giunchedi) [12:56:04] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [12:56:17] godog: heh [12:56:51] (03CR) 10jerkins-bot: [V: 04-1] ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [12:57:05] godog: your recheck was luckier than mine! [12:57:23] phuedx: I'm not familiar with your patch or the test that failed :( [12:57:46] ema: lol, me and hashar had to do some work before that could happen tho! [12:58:04] I would hesitate to merge the patch with a failed test :/ is it urgent? or can it wait until hashar or somebody can take a look? [12:58:11] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: prune unused files and templates [puppet] - 10https://gerrit.wikimedia.org/r/472126 (owner: 10Giuseppe Lavagetto) [12:58:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Relabel labvirt1017.eqiad.wmnet as cloudvirt1017.eqiad.wmnet - https://phabricator.wikimedia.org/T208945 (10aborrero) [12:59:16] (03CR) 10Filippo Giunchedi: [C: 032] "Fixed with latest Docker image" [puppet] - 10https://gerrit.wikimedia.org/r/472666 (owner: 10Filippo Giunchedi) [12:59:24] (03PS4) 10Filippo Giunchedi: mtail: more verbose test output on failure [puppet] - 10https://gerrit.wikimedia.org/r/472666 [12:59:38] <_joe_> godog: what was wrong? [13:00:05] zeljkof: it is urgent. i understand the hesitation. i just don't see a maintenance script affecting a test like that [13:00:09] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] "Note jenkins will fail on this change, fixed in If557fcde61" [puppet] - 10https://gerrit.wikimedia.org/r/472666 (owner: 10Filippo Giunchedi) [13:00:34] (03PS3) 10Ema: ATS: allow specifying remap-rule-specific Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/472971 (https://phabricator.wikimedia.org/T209021) [13:00:36] (03PS3) 10Ema: ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) [13:00:39] phuedx: ok, feel free to manually merge it (remove jenkins's votes) and deploy [13:00:42] (03CR) 10Filippo Giunchedi: [C: 032] mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 (owner: 10Filippo Giunchedi) [13:00:49] (03PS4) 10Filippo Giunchedi: mtail: fix kernel.mtail compilation [puppet] - 10https://gerrit.wikimedia.org/r/472667 [13:00:56] _joe_: wrong version of mtail [13:01:07] <_joe_> oh ok simple as that [13:01:31] yeah simple enough [13:01:41] (03CR) 10jerkins-bot: [V: 04-1] ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [13:01:49] 10-15 minutes is such a big feedback loop when your window is 1 hour [13:02:19] (03CR) 10Filippo Giunchedi: [C: 04-2] "mtail tests now are fixed in production, this change isn't needed anymore" [puppet] - 10https://gerrit.wikimedia.org/r/472200 (owner: 10Bstorm) [13:02:49] (03CR) 10Filippo Giunchedi: [C: 04-2] "Fixed in If557fcde6 and Ifba6a5c4" [puppet] - 10https://gerrit.wikimedia.org/r/472200 (owner: 10Bstorm) [13:03:30] (03PS4) 10Ema: ATS: allow specifying remap-rule-specific Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/472971 (https://phabricator.wikimedia.org/T209021) [13:03:32] (03PS4) 10Ema: ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) [13:04:02] zeljkof: i'm going to merge it, get it on mwdebug1002, then i'll sync the autoload.php and then the maintenance script file [13:04:10] phuedx: ok [13:04:40] zeljkof: should i manually submit the change too? [13:04:57] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) To my understanding, there are 3 ways an VM do egress traffic (https://wikitech.wikime... [13:05:01] (also, the zuul status monitor appears to be empty :/) [13:05:38] phuedx: I think you have to manually submit, if jenkins didn't do it [13:05:43] (since a job failed) [13:05:53] (03CR) 10jenkins-bot: Add new throttle rule for Wikipedia event in Ireland on 2018-11-13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472691 (https://phabricator.wikimedia.org/T209037) (owner: 10Zoranzoki21) [13:05:56] (03PS2) 10Giuseppe Lavagetto: php::extension: use version-specific package name by default [puppet] - 10https://gerrit.wikimedia.org/r/470863 (https://phabricator.wikimedia.org/T208433) [13:06:53] (03CR) 10GTirloni: hieradata: add a note about namespaced keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472952 (https://phabricator.wikimedia.org/T209265) (owner: 10Filippo Giunchedi) [13:07:26] change is on mwdebug1002 [13:10:13] ok. browsed to a couple of large wikipedias while pinned to mwdebug1002 and everything looked good (this is, after all, a script that shouldn't be touched) [13:12:09] syncing autoload.php and then the maintenance script [13:12:10] !log upgrade the Hadoop Analytics cluster to CDH 5.15 (downtime required) [13:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:14] (03PS3) 10Giuseppe Lavagetto: php::extension: use version-specific package name by default [puppet] - 10https://gerrit.wikimedia.org/r/470863 (https://phabricator.wikimedia.org/T208433) [13:14:45] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) [13:14:57] !log phuedx@deploy1001 Synchronized php-1.33.0-wmf.3/autoload.php: SWAT: [[gerrit:472958|Provide a script to reset the page_random column (T208909)]] (duration: 00m 55s) [13:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:00] T208909: [Bug] Update old nonuniformly distributed page_random values - https://phabricator.wikimedia.org/T208909 [13:15:13] phuedx: finished with the deploy? [13:15:30] zeljkof: syncing the maintenance script then i'm done [13:15:32] one sec [13:15:34] and thanks [13:16:30] !log updating liblognorm on stretch to 2.0.3-1~bpo9+1wmf1 [13:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:45] !log phuedx@deploy1001 Synchronized php-1.33.0-wmf.3/maintenance/resetPageRandom.php: SWAT: [[gerrit:472958|Provide a script to reset the page_random column (T208909)]] (duration: 00m 53s) [13:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:52] zeljkof: ^^ done [13:17:12] phuedx: great! sorry about the CI trouble :/ [13:17:18] !log EU SWAT finished [13:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:31] zeljkof: not your fault. sorry for being a little pushy. this is blocking a large chunk of work for us [13:18:11] * phuedx keeps an eye on fatalmonitor [13:18:23] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) [13:19:08] (03CR) 10Filippo Giunchedi: hieradata: add a note about namespaced keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472952 (https://phabricator.wikimedia.org/T209265) (owner: 10Filippo Giunchedi) [13:20:02] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) I guess there was a reason for these DC-wide bypasses. Before I start doing git/phab a... [13:23:11] !log phuedx@mwmaint1002 running resetPageRandom.php maintenance script for testwiki [13:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:16] (03PS5) 10Ema: ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) [13:29:18] (03PS1) 10Ema: ATS: add trafficserver::lua_infra [puppet] - 10https://gerrit.wikimedia.org/r/472984 [13:32:20] !log phuedx@mwmaint1002 running restPageRandom.php maintenance script for mediawikiwiki [13:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:31] (03CR) 10Ema: [C: 032] ATS: allow specifying remap-rule-specific Lua scripts [puppet] - 10https://gerrit.wikimedia.org/r/472971 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [13:33:42] (03CR) 10Ema: [C: 032] ATS: add trafficserver::lua_infra [puppet] - 10https://gerrit.wikimedia.org/r/472984 (owner: 10Ema) [13:33:51] (03CR) 10Ema: [C: 032] ATS: set X-MediaWiki-Original for upload.w.o [puppet] - 10https://gerrit.wikimedia.org/r/472972 (https://phabricator.wikimedia.org/T209021) (owner: 10Ema) [13:40:49] (03PS1) 10Filippo Giunchedi: swift: add statsd_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/472986 (https://phabricator.wikimedia.org/T205870) [13:41:04] !log starting rolling restart of elasticsearch codfw for JVM upgrade [13:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:26] (03CR) 10jerkins-bot: [V: 04-1] swift: add statsd_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/472986 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:41:44] ;_; [13:45:46] (03PS2) 10Filippo Giunchedi: swift: add statsd_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/472986 (https://phabricator.wikimedia.org/T205870) [13:46:45] 10Operations, 10Traffic, 10Patch-For-Review: ATS backend-side request-mangling - https://phabricator.wikimedia.org/T209021 (10ema) X-MediaWiki-Original is now being set by ATS with a remap rule specific to upload.wikimedia.org: $ curl -sIH "Host: upload.wikimedia.org" http://cp1072.eqiad.wmnet:3129/wikipe... [13:46:56] !log updating libfastjson on stretch to 0.99.8-1~bpo9+1wmf1 [13:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:11] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/13439/" [puppet] - 10https://gerrit.wikimedia.org/r/472986 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:50:15] (03CR) 10Filippo Giunchedi: [C: 032] swift: add statsd_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/472986 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [13:50:23] (03PS3) 10Filippo Giunchedi: swift: add statsd_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/472986 (https://phabricator.wikimedia.org/T205870) [13:52:08] (03CR) 10Muehlenhoff: "Yeah, agreed, an internal package (with a version different from bpo) seems best." [puppet] - 10https://gerrit.wikimedia.org/r/472694 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [13:53:52] 10Operations: debmonitor search yields nothing - https://phabricator.wikimedia.org/T209279 (10hashar) [13:55:11] !log Upgrading Jenkins on contint1001 , contint2001, releases1001 and releases2002 | T209264 [13:55:12] 10Operations: debmonitor search yields nothing - https://phabricator.wikimedia.org/T209279 (10MoritzMuehlenhoff) Where did you search specifically? Does e.g. https://debmonitor.wikimedia.org/packages/jenkins list the installed packages for you? [13:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:14] T209264: Upgrade Jenkins instances to 2.138.3 - https://phabricator.wikimedia.org/T209264 [13:59:00] !log Deploy schema change on db1101:3318 - T203709 [13:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:07] T203709: Schema change for adding indexes of ct_tag_id - https://phabricator.wikimedia.org/T203709 [13:59:28] !log phuedx@mwmaint1002 running restPageRandom.php maintenance script for small wikis (small.dblist) [13:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:58] 10Operations: debmonitor search yields nothing - https://phabricator.wikimedia.org/T209279 (10Volans) I guess he's referring to the search bar at the top-right, pending code review since July ;) [14:01:09] 10Operations: debmonitor search yields nothing - https://phabricator.wikimedia.org/T209279 (10Volans) [14:02:03] 10Operations: debmonitor search yields nothing - https://phabricator.wikimedia.org/T209279 (10hashar) I went to https://debmonitor.wikimedia.org/ , there is a search box in the menu bar on the top right, I have simply filled search terms there (`jenkins` or `contint1001`) but no luck, I am simply redirected to h... [14:02:35] 10Operations: debmonitor search yields nothing - https://phabricator.wikimedia.org/T209279 (10hashar) Ha @Volans nailing it: the search is not implemented :) thanks! [14:04:06] 10Operations: debmonitor search yields nothing - https://phabricator.wikimedia.org/T209279 (10MoritzMuehlenhoff) @hashar: Search is implemented in general: If you click either of "Hosts", "Kernels", "Packages" or "Source Packages" you can search in there. [14:06:03] (03CR) 10DCausse: elasticsearch: cookbook for multi-cluster services rolling restart (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/467964 (https://phabricator.wikimedia.org/T207919) (owner: 10Mathew.onipe) [14:06:04] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.3889 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [14:08:14] !log updating rsyslog on stretch to 8.38.0-1~bpo9+1wmf1 [14:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:35] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0.03463 https://grafana.wikimedia.org/dashboard/db/logstash [14:15:15] PROBLEM - DPKG on ping1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:17:28] !log phuedx@mwmaint1002 running restPageRandom.php maintenance script for medium wikis [14:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:53] only just realised that i've been misspelling "reset" [14:17:56] one of those days [14:18:44] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[rsyslog],Package[rsyslog-gnutls] [14:23:14] RECOVERY - DPKG on ping1001 is OK: All packages OK [14:23:16] (03PS1) 10Filippo Giunchedi: thumbor: relay statsd_exporter metrics to localhost [puppet] - 10https://gerrit.wikimedia.org/r/472996 (https://phabricator.wikimedia.org/T205870) [14:23:45] RECOVERY - puppet last run on wtp1036 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:27:31] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13440/thumbor2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/472996 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:27:39] (03PS2) 10Filippo Giunchedi: thumbor: relay statsd_exporter metrics to localhost [puppet] - 10https://gerrit.wikimedia.org/r/472996 (https://phabricator.wikimedia.org/T205870) [14:39:51] (03PS1) 10Filippo Giunchedi: swift: turn on statsd_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/473006 (https://phabricator.wikimedia.org/T205870) [14:42:17] (03PS2) 10Filippo Giunchedi: swift: turn on statsd_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/473006 (https://phabricator.wikimedia.org/T205870) [14:43:59] (03CR) 10GTirloni: hieradata: add a note about namespaced keys (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/472952 (https://phabricator.wikimedia.org/T209265) (owner: 10Filippo Giunchedi) [14:44:49] (03PS17) 10Banyek: mariadb: refactoring parsercache role to module & add pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) [14:46:45] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:46:53] (03CR) 10Marostegui: [C: 031] mariadb: refactoring parsercache role to module & add pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [14:47:24] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:48:33] (03CR) 10Filippo Giunchedi: [C: 032] "PCC https://puppet-compiler.wmflabs.org/compiler1002/13443/" [puppet] - 10https://gerrit.wikimedia.org/r/473006 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [14:48:41] (03PS3) 10Filippo Giunchedi: swift: turn on statsd_exporter in codfw [puppet] - 10https://gerrit.wikimedia.org/r/473006 (https://phabricator.wikimedia.org/T205870) [14:49:04] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [14:49:35] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [14:49:41] !log disabling puppet on parsercache hosts - pc[12]00[456] (T208383) [14:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:44] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [14:53:10] (03CR) 10Banyek: [C: 032] mariadb: refactoring parsercache role to module & add pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [14:53:33] (03PS18) 10Banyek: mariadb: refactoring parsercache role to module & add pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) [14:53:36] (03CR) 10Banyek: [V: 032 C: 032] mariadb: refactoring parsercache role to module & add pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/470851 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [14:53:54] _joe_: is there any plan for a generic static html serving nginx docker image for the wmf? [14:54:24] <_joe_> addshore: I never thought we had a use-case for that [14:54:28] I'm working on https://phabricator.wikimedia.org/T192006 moving the wdqs UI to the pipeline, and we don't actually need to run the service from a node image, we just need nginx [14:55:25] I mean, I could make it use some node based web thing to serve the content instead, but perhaps an nginx image would make more sense? [14:58:45] PROBLEM - puppet last run on analytics1039 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/lib/hadoop/data/d/yarn] [14:59:51] hm! [15:00:36] _joe_: i guess if I were to use something like https://www.npmjs.com/package/http-server i would have to get a security review for it (perhaps?) vs just using nginx [15:01:23] <_joe_> let's just use nginx please [15:01:41] makes much more sense [15:01:43] imho [15:01:58] _joe_: yup, right, let me file a ticket about an nginx image then and look into it [15:02:08] <_joe_> addshore: so the UI is just statick html? [15:02:14] yup [15:02:23] <_joe_> addshore: ok so you don't need a new ticket [15:02:48] there is one? :) [15:02:48] <_joe_> I might ask why do you want to move a bunch of static html to kubernetes though [15:03:21] <_joe_> so yeah, can I be included in the parent task? Because I'm not sure it's a wise choice [15:03:52] we want the build process of the build pipeline basically [15:04:10] <_joe_> for static html? [15:04:20] and we would like to be able to give external people using the query service ui the same docker image / setup that we use [15:04:21] <_joe_> again, please add me to the ticket so I can understand :) [15:04:41] yup, its linked above, https://phabricator.wikimedia.org/T192006, I'll cc you [15:06:03] we already have and maintain a docker image for the query service UI, and we want to continue doing that, so having the same image be used across the board for it makes sense [15:06:16] (03PS1) 10Banyek: mariadb: silence pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/473021 (https://phabricator.wikimedia.org/T208383) [15:06:33] the ticket could probably do with some cleaning up as it has evolved over the year [15:08:11] (03CR) 10Jcrespo: [C: 031] mariadb: silence pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/473021 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [15:08:41] (03CR) 10Banyek: [C: 032] mariadb: silence pc2007 [puppet] - 10https://gerrit.wikimedia.org/r/473021 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [15:09:41] <_joe_> addshore: except deploying static files in kubernetes is a bit of a waste of resources [15:11:12] well, the pie cuts many ways, maintaining 2 different deployment methods for the thing is also a bit of a waste of resources, albeit different resources [15:12:38] 10Puppet, 10Patch-For-Review: Validate no namespaced keys are present in hieradata/*.yaml - https://phabricator.wikimedia.org/T209265 (10fgiunchedi) I've added a note to those files as a bandaid, though these key/value from `common.yaml` will need to be checked: ` profile::openstack::main::cumin::auth_group:... [15:14:51] <_joe_> addshore: well for what pertains production we're not really talking about two methods [15:15:07] <_joe_> addshore: it's like saying we should deploy mediawiki from tarballs, more or less [15:15:42] <_joe_> and well, docker images *literally* are tarballs [15:16:07] Where are the resources being wasted anyway? [15:16:51] <_joe_> running a specialized nginx instance (multiple copies) in kubernetes for serving a bunch of static files we could easily host on the wdqs servers? [15:17:02] <_joe_> it's a waste of resources for sure. [15:17:16] 10Operations, 10Beta-Cluster-Infrastructure, 10DNS, 10Traffic, and 4 others: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs - https://phabricator.wikimedia.org/T153468 (10MoritzMuehlenhoff) Indeed,... [15:17:35] (multiple copies), mulitple copies is something that is enforced? or? [15:17:38] <_joe_> and more in general, we have servers from which we serve mostly static files [15:18:14] <_joe_> addshore: well if you want to obtain availability, typically, yes [15:18:30] moritzm, nice! [15:18:47] <_joe_> all the properties for which is desirable to run a service on kubernetes are just lacking when it comes to serving some static content via nginx [15:18:54] that libnet-dns-perl release was only an RC [15:18:59] _joe_: well, we could end up in the situation where we have the built images using blubber and just don't deploy using them then? [15:19:20] <_joe_> oh, sure, if that suits you :) [15:19:21] Krenair: yeah, I'll file a bug to get this updated in Debian once the final 1.19 is out [15:19:33] but we want to migrate the image build process to the pipeline (even if we miss out the deploy bit) [15:19:35] moritzm, cool. does debian just update for things like this? [15:19:45] or are we talking about for future debian releases? [15:19:56] which returns us to the previous question of an nginx image :) [15:20:04] <_joe_> addshore: ok, so you would need an nginx image [15:20:07] * addshore is happy to go and write it / try to write it [15:20:40] Krenair: not sure, this might be below the bar for a stretch stable update, but I'll see what I can do [15:20:45] ok [15:20:48] <_joe_> well if we have to build one, I'd like to get some input from our nginx experts first, and then we need to reason on how to make it configurable [15:20:53] in any case I'll make sure this lands in buster [15:21:07] _joe_: okay, should I file a ticket? :) [15:21:08] <_joe_> because well, an image for just serving static content by default? no TLS? [15:21:12] moritzm, great. out of interest were you guys seeing this in prod too? [15:21:20] <_joe_> addshore: ok [15:22:23] Krenair: kind of, it has been worked around (e.g. the ferm rules for some of the dumps servers have separate lists for mirrors which don't have an AAAA record) and godog also ran into it for prometheus as labmon doesn't have an AAAA either [15:22:27] 10Puppet, 10Patch-For-Review: Validate no namespaced keys are present in hieradata/*.yaml - https://phabricator.wikimedia.org/T209265 (10Volans) Regarding the few that I know: - `profile::openstack::main::cumin::auth_group: cumin_masters` doesn't actually seems to be defined elsewhere, it should probably be mo... [15:22:35] but this will allow to clean this up [15:22:38] interesting [15:22:44] okay [15:22:50] well thanks for taking care of the packaging stuff [15:24:25] thanks for reporting upstream! [15:24:56] for reference, the dumps stuff I mentioned is in modules/profile/manifests/dumps/distribution/ferm.pp [15:26:36] _joe_: what would be the correct phab project? [15:27:03] <_joe_> ha! good q [15:27:15] <_joe_> tag releng :P [15:27:39] <_joe_> or better, ask them [15:28:05] I'll ask thcipriani when he wakes up [15:29:15] <_joe_> because it's asking for a new docker image in production-images, so that's operations, but also some other things, so I'd look there [15:29:36] <_joe_> is there a tag for the pipeline? one for blubber? [15:29:46] <_joe_> which ones are appropriate? I dunno :) [15:29:53] <_joe_> oh thcipriani is prolly off today [15:30:11] there is a "Release Pipeline" tsk [15:30:20] *tag [15:30:27] <_joe_> thcipriani: isn't today an holiday for you? [15:30:37] ish [15:31:03] but I was just making coffee and saw a ping :) [15:32:42] hehe [15:33:00] I can do release pipeline and blubber? :) [15:35:32] 10Operations, 10Wikibase-Containers, 10Wikidata, 10wikidata-tech-focus, 10Release Pipeline (Blubber): Create a wmf production ready nginx image - https://phabricator.wikimedia.org/T209292 (10Addshore) p:05Triage>03Normal [15:35:39] We've been using the blubber tag for work on Blubber itself. [15:36:44] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473028 [15:36:49] although I noticed you tagged some current work blubber and it made sense so we might need to add some columns on that workboard to keep it organized. [15:36:50] 10Operations, 10SRE-Access-Requests, 10netops, 10Patch-For-Review: Access to network devices for Riccardo (volans) - https://phabricator.wikimedia.org/T208726 (10ArielGlenn) I hear ya, just looking to get it off our clinic duty dashboard :-) [15:37:44] (03CR) 10Cwhite: "> The uid in ldap for the account wmde-leszek is 12300; can you say" [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) (owner: 10Cwhite) [15:39:19] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473028 (owner: 10Marostegui) [15:40:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473028 (owner: 10Marostegui) [15:40:38] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10ArielGlenn) >>! In T208750#4739015, @Addshore wrote: ... > > I can easily get WMDE manager sign off, I'm not rea... [15:42:09] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10Addshore) >>! In T208750#4739752, @ArielGlenn wrote: >> I can easily get WMDE manager sign off, I'm not really su... [15:43:26] 10Operations, 10SRE-Access-Requests, 10WMDE-Analytics-Engineering, 10Graphite, 10User-Addshore: Requesting access to graphite hosts for addshore - https://phabricator.wikimedia.org/T208750 (10ArielGlenn) >>! In T208750#4739774, @Addshore wrote: >>>! In T208750#4739752, @ArielGlenn wrote: >> Hrm, good que... [15:43:47] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1101:3318 (duration: 00m 53s) [15:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:24] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) I'd prefer if we used 'analytics' instead of 'data lake'. Can we do cloudvirtanXXXX? cloudvirt-anXXXX? [15:45:49] !log stop and upgrade db2094 [15:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:27] (03CR) 10ArielGlenn: "Woops." [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) (owner: 10Cwhite) [15:47:01] 10Operations, 10Release Pipeline, 10Release-Engineering-Team: Design pipeline image versioning scheme - https://phabricator.wikimedia.org/T209088 (10LarsWirzenius) I'd fine the timestamp part of the tag much easier to read if it used a delimiter between date and time: mediawiki-services-zotero:20181019-1652... [15:47:42] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1101:3318" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473028 (owner: 10Marostegui) [15:47:48] (03CR) 10Cwhite: "> Woops." [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) (owner: 10Cwhite) [15:48:01] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:48:15] _joe_: thcipriani i actually made the assumption that blubber / build pipeline users that are not being deployed in wmf production need to use wmf images, is that actually the case? [15:49:11] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:49:26] <_joe_> addshore: indeed it is [15:50:40] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [15:51:20] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:51:30] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:52:50] PROBLEM - proton endpoints health on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:54:21] PROBLEM - configured eth on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:54:21] PROBLEM - DPKG on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:54:31] PROBLEM - Check whether ferm is active by checking the default input chain on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:54:50] PROBLEM - Check systemd state on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:54:50] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:54:50] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:55:11] PROBLEM - puppet last run on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:55:20] PROBLEM - dhclient process on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:55:20] PROBLEM - Check size of conntrack table on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:55:21] PROBLEM - Disk space on proton1002 is CRITICAL: Return code of 255 is out of bounds [15:55:50] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:55:59] what's up with proton? [15:56:00] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:56:43] OOM [15:57:29] * volans forcing a puppet run [15:57:40] does it look like it's the specific pdf being rendered, or...? [15:58:17] so almost OOM, the oom-killer didn't actually kill teh memory-consuming procs [15:58:49] I guess it should be restarted but I have zero knowledge on this system, so waiting a moment if there is anyone knowing it around that wants to debug first [15:59:40] I don't either but I'll see if wikitech has anything to say [15:59:57] free RAM is close to zero, nrpe failed (hence why the alarms) and puppet cannot run [16:00:40] _joe_: do you know anything about proton by any chance? [16:00:51] wow it has zero to say. ugh [16:00:57] wikitech seems empty to mee [16:01:04] <_joe_> volans: more or less, yes [16:01:13] <_joe_> it's supposed to be still experimental [16:01:24] proton1002 has close to zero available RAM, nrpe and puppet failing [16:01:33] OOM-killer didn't yet kill anything [16:01:37] <_joe_> volans: it's running headless chrome, so I wouldn't be surprised if it leaked memory [16:01:38] should we restart it? [16:01:44] <_joe_> volans: go ahead [16:02:14] !log restarted proton on proton1002 [16:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:56] running puppet now [16:02:57] (03CR) 10ArielGlenn: "Okay, yeah, we re-use the ldap uid these days. If you haven't done that before, instructions and an explanation are here: https://github.c" [puppet] - 10https://gerrit.wikimedia.org/r/472052 (https://phabricator.wikimedia.org/T208717) (owner: 10Cwhite) [16:03:00] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [16:03:10] RECOVERY - dhclient process on proton1002 is OK: PROCS OK: 0 processes with command name dhclient [16:03:11] RECOVERY - Check size of conntrack table on proton1002 is OK: OK: nf_conntrack is 0 % full [16:03:11] RECOVERY - Disk space on proton1002 is OK: DISK OK [16:03:30] RECOVERY - configured eth on proton1002 is OK: OK - interfaces up [16:03:30] RECOVERY - DPKG on proton1002 is OK: All packages OK [16:03:31] RECOVERY - Check whether ferm is active by checking the default input chain on proton1002 is OK: OK ferm input default policy is set [16:03:41] now we know [16:03:50] RECOVERY - Check systemd state on proton1002 is OK: OK - running: The system is fully operational [16:05:21] (03PS1) 10Effie Mouzeli: jobqueue_redis: Purge role jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/473029 (https://phabricator.wikimedia.org/T198220) [16:05:21] RECOVERY - puppet last run on proton1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:09:14] !log phuedx@mwmaint1002 running restPageRandom.php maintenance script for large wikis [16:09:14] (03PS1) 10Banyek: mariadb: add pc2008, pc2009, pc2010 parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/473032 (https://phabricator.wikimedia.org/T208383) [16:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:35] !log upgrade prometheus-mcrouter-exporter on all the mw* hosts to the new version [16:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:39] 10Operations, 10DBA: db2061 has predictive disk errors - https://phabricator.wikimedia.org/T208957 (10Marostegui) 05Open>03declined As spoken via IRC, let's wait for these disks to finally fail (we don't have spares anyways) and hosts with predictive errors are being tracked at {T208323} [16:16:19] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [16:18:10] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:18:35] checking --^ [16:21:21] PROBLEM - Check systemd state on notebook1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:23:40] RECOVERY - Check systemd state on notebook1004 is OK: OK - running: The system is fully operational [16:25:43] (03PS2) 10Ema: Expose Varnish X-Cache-Status via Server-Timing [puppet] - 10https://gerrit.wikimedia.org/r/472401 (https://phabricator.wikimedia.org/T207862) (owner: 10Gilles) [16:26:22] (03CR) 10Ema: [C: 032] Expose Varnish X-Cache-Status via Server-Timing [puppet] - 10https://gerrit.wikimedia.org/r/472401 (https://phabricator.wikimedia.org/T207862) (owner: 10Gilles) [16:32:08] (03CR) 10Marostegui: [C: 031] mariadb: add pc2008, pc2009, pc2010 parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/473032 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [16:32:45] (03Abandoned) 10Herron: kafka_shipper: pin librdkafka1 to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/472694 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [16:39:33] (03CR) 10Banyek: "https://puppet-compiler.wmflabs.org/compiler1002/13446/" [puppet] - 10https://gerrit.wikimedia.org/r/473032 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [16:41:22] !log disabling puppet on parsercache hosts (T208383) [16:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:26] T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] - https://phabricator.wikimedia.org/T208383 [16:44:01] (03CR) 10Banyek: [C: 032] mariadb: add pc2008, pc2009, pc2010 parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/473032 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [16:44:10] (03PS2) 10Banyek: mariadb: add pc2008, pc2009, pc2010 parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/473032 (https://phabricator.wikimedia.org/T208383) [16:44:14] (03CR) 10Banyek: [V: 032 C: 032] mariadb: add pc2008, pc2009, pc2010 parsercaches [puppet] - 10https://gerrit.wikimedia.org/r/473032 (https://phabricator.wikimedia.org/T208383) (owner: 10Banyek) [16:51:20] PROBLEM - puppet last run on pc2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 35 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [16:51:44] that's ok, it's me^ [16:52:10] PROBLEM - puppet last run on pc2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [16:52:50] PROBLEM - puppet last run on pc2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] [16:53:09] (03PS11) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [16:53:48] (03CR) 10ArielGlenn: "Tried to bring the changeset up to date, might need a couple more tweaks. As you say, 'just in case'." [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [16:55:41] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 61.86, 31.96, 18.21 [16:56:27] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Miriam) p:05Triage>03High [16:57:11] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 49.90, 27.10, 15.87 [16:57:40] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 58.06, 33.37, 18.75 [16:57:51] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 48.45, 30.93, 18.05 [16:57:58] (03PS1) 10DCausse: Fix surrogate pair issues [software/elasticsearch/plugins] (5.x) - 10https://gerrit.wikimedia.org/r/473045 (https://phabricator.wikimedia.org/T209293) [16:58:12] <_joe_> oh oh [16:58:18] <_joe_> trouble in appserverland [16:59:56] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Epic: Server Access for 3 formal collaborators - https://phabricator.wikimedia.org/T209298 (10Miriam) [17:00:19] _joe_: I see api dbs of s2 with 3x the normal load [17:00:28] maybe more [17:01:38] I also see a high number of writes on s1, but not sure if related [17:01:48] 10Operations, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 3 others: Begin the implementation of Q1's Logging Infrastructure design (2018-19 Q2 Goal) - https://phabricator.wikimedia.org/T205849 (10herron) [17:01:55] <_joe_> s2 is what? [17:02:23] zhwiki and other 11 [17:02:26] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s2&var-role=All&from=1542031328598&to=1542042128598 [17:03:07] _joe_: https://noc.wikimedia.org/db.php too long to paste [17:03:31] <_joe_> jynus: uhm, thanks [17:03:41] load still going up [17:04:03] <_joe_> https://grafana.wikimedia.org/dashboard/db/apache-hhvm?orgId=1&from=now-1h&to=now shows the rais in cpu usage AND network usage at the same time [17:05:20] PROBLEM - High CPU load on API appserver on mw1225 is CRITICAL: CRITICAL - load average: 45.69, 36.03, 24.74 [17:05:44] ACKNOWLEDGEMENT - MariaDB Slave IO: pc1 on pc2007 is CRITICAL: CRITICAL slave_io_state could not connect Banyek T208383 [17:05:45] ACKNOWLEDGEMENT - MariaDB Slave Lag: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_lag could not connect Banyek T208383 [17:05:45] ACKNOWLEDGEMENT - MariaDB Slave SQL: pc1 on pc2007 is CRITICAL: CRITICAL slave_sql_state could not connect Banyek T208383 [17:05:47] don't know if relevant but it seems that parsoid_batch latency is higher than last week https://grafana.wikimedia.org/dashboard/db/api-requests-breakdown?refresh=5m&orgId=1&from=now-1h&to=now&var-metric=p99&var-module=parsoid_batch [17:05:52] ACKNOWLEDGEMENT - mysqld processes on pc2007 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Banyek T208383 [17:05:52] ACKNOWLEDGEMENT - puppet last run on pc2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat] Banyek T208383 [17:05:55] 10Operations, 10cloud-services-team (Kanban): Reboot WMCS servers for L1TF - https://phabricator.wikimedia.org/T207377 (10aborrero) [17:07:36] _joe_: should I pool more api servers or will that make the issue worse? [17:07:47] ACKNOWLEDGEMENT - mysqld processes on pc2008 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Banyek T208383 [17:08:17] <_joe_> jynus: wait a sec, but yes you should [17:10:28] (03PS1) 10Jcrespo: mariadb: Pool db1076 into s2-api databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473048 [17:10:37] <_joe_> !log depooling mw1222 for debug [17:10:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:25] ^ I have that patch ready [17:11:29] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 43.62, 35.69, 26.57 [17:11:42] hey [17:11:45] Can I help? [17:12:03] things are starting now to go down [17:12:16] but only as an inflection [17:12:51] (03Abandoned) 10Bstorm: mtail: attempt to make the tests not-broken [puppet] - 10https://gerrit.wikimedia.org/r/472200 (owner: 10Bstorm) [17:13:07] marostegui: we are hacing 120 KQPS in s2 when we normally have 40K [17:13:22] uff [17:13:26] and s1? [17:13:49] PROBLEM - MariaDB Slave IO: pc3 on pc2009 is CRITICAL: CRITICAL slave_io_state could not connect [17:14:13] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=8&fullscreen&orgId=1&from=1542036725272&to=1542042835272 [17:14:29] I don't know if that is related, happend much earlier [17:15:28] at the moment I am only worried about s2 api, which fits the app server issue [17:15:30] <_joe_> !log restarting HHVM on the high-cpu api hosts in eqiad, to ease the pressure and latencies [17:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:55] _joe_: any call on deploying or not? I will deploy if not [17:16:32] <_joe_> please do [17:16:43] doing [17:17:13] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1076 into s2-api databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473048 (owner: 10Jcrespo) [17:17:30] PROBLEM - MariaDB Slave SQL: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_state could not connect [17:17:34] _joe_: it is a feedback loop [17:17:39] (03CR) 10jenkins-bot: mariadb: Pool db1076 into s2-api databases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473048 (owner: 10Jcrespo) [17:17:47] if they are slow, app servers get affected and the otherway around [17:19:12] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool more resources into s2 api (duration: 00m 54s) [17:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:59] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 11.95, 20.84, 23.98 [17:20:07] <_joe_> jynus: yes, btw right now the traffic ceased [17:20:11] is that the one your restarted? [17:20:43] _joe_: yes,I can see it on my side too [17:20:57] my deploy had 0 impact [17:21:12] well, not 0, but wasn't why it stopped [17:21:15] <_joe_> I restarted a few, but traffic just went back to its normal volumes [17:21:19] RECOVERY - High CPU load on API appserver on mw1225 is OK: OK - load average: 8.79, 17.45, 23.45 [17:21:31] yep, can see that also on qps [17:21:57] I will do a second deploy to leave things in the most resilient way possible even if this happens again [17:22:14] but it would be nice to know what was the case [17:22:18] *cause [17:22:22] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: re-create script for manual paging - https://phabricator.wikimedia.org/T82937 (10Dzahn) Thanks for reporting @aborrero Yes, you have 2 contacts and we should switch you to AQL and i need to fix the script. I will get to that tomorrow (including the... [17:22:37] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: re-create script for manual paging - https://phabricator.wikimedia.org/T82937 (10Dzahn) p:05Low>03High [17:23:17] PROBLEM - mysqld processes on pc2009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [17:23:30] PROBLEM - MariaDB Slave Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:24:26] banyek: ^ [17:25:30] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 9.37, 13.46, 23.54 [17:25:30] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10herron) p:05Triage>03Normal [17:26:37] there's so much cruft in kibana from parsoid it's hard to tell what's normal and what isn't >_< [17:27:27] I am trying to make a sane db config [17:27:32] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10herron) @Ottomata and @elukey what do you think? [17:29:33] ACKNOWLEDGEMENT - MariaDB Slave IO: pc2 on pc2008 is CRITICAL: CRITICAL slave_io_state could not connect Banyek T208383 [17:29:33] ACKNOWLEDGEMENT - MariaDB Slave Lag: pc2 on pc2008 is CRITICAL: CRITICAL slave_sql_lag could not connect Banyek T208383 [17:29:33] ACKNOWLEDGEMENT - MariaDB Slave SQL: pc2 on pc2008 is CRITICAL: CRITICAL slave_sql_state could not connect Banyek T208383 [17:29:33] ACKNOWLEDGEMENT - MariaDB Slave IO: pc3 on pc2009 is CRITICAL: CRITICAL slave_io_state could not connect Banyek T208383 [17:29:33] ACKNOWLEDGEMENT - MariaDB Slave Lag: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_lag could not connect Banyek T208383 [17:29:33] ACKNOWLEDGEMENT - MariaDB Slave SQL: pc3 on pc2009 is CRITICAL: CRITICAL slave_sql_state could not connect Banyek T208383 [17:29:40] ACKNOWLEDGEMENT - mysqld processes on pc2009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Banyek T208383 [17:29:40] ACKNOWLEDGEMENT - MariaDB Slave IO: pc1 on pc2010 is CRITICAL: CRITICAL slave_io_state could not connect Banyek T208383 [17:29:40] ACKNOWLEDGEMENT - MariaDB Slave Lag: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_lag could not connect Banyek T208383 [17:29:40] ACKNOWLEDGEMENT - MariaDB Slave SQL: pc1 on pc2010 is CRITICAL: CRITICAL slave_sql_state could not connect Banyek T208383 [17:29:47] ACKNOWLEDGEMENT - mysqld processes on pc2010 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Banyek T208383 [17:29:54] (03PS1) 10Ottomata: Add rsync module on thorium for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/473050 [17:30:59] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10Ottomata) I believe we had this problem (and discussion) before...and we de... [17:31:09] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 9.21, 12.02, 23.68 [17:35:10] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 9.88, 11.58, 23.04 [17:36:09] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) BTW, I'm focusing on the `eqiad1` deployment setting. Not paying much attention to the... [17:36:20] (03PS1) 10Jcrespo: mariadb: Optimize s2 for throughput, not for latency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473051 [17:36:29] marostegui: ^ [17:36:38] I accept suggestions [17:37:26] let me check [17:37:51] anything I propose can backfire [17:37:56] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10herron) >>! In T209300#4740111, @Ottomata wrote: > I believe we had this pro... [17:38:05] so I am not sure there is a "good option" here [17:38:33] yeah, that looks good [17:38:48] there is not much room to maneouvre [17:38:55] I just want them to survive another spike like that [17:39:02] 10Operations, 10Cloud-VPS, 10netops, 10cloud-services-team (Kanban): dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596 (10aborrero) >>! In T174596#4740124, @aborrero wrote: > BTW, I'm focusing on the `eqiad1` deploymen... [17:39:39] marostegui: actuallym my fault [17:39:42] I editted s1 [17:39:59] I thought that was intended :) [17:40:08] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10Ottomata) I'm fairly certain there shouldn't be any streth hosts using 0.9.3... [17:40:45] no, s1 is ok [17:40:47] (03PS2) 10Jcrespo: mariadb: Optimize s2 for throughput, not for latency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473051 [17:40:52] s2 was the one with issues [17:41:04] (03CR) 10Marostegui: [C: 031] mariadb: Optimize s2 for throughput, not for latency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473051 (owner: 10Jcrespo) [17:41:17] that should do [17:41:32] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287 (10MoritzMuehlenhoff) @Cmjohnson : Per the procurement task, thermal paste is now available? [17:41:33] (03CR) 10Jcrespo: [C: 032] mariadb: Optimize s2 for throughput, not for latency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473051 (owner: 10Jcrespo) [17:43:28] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10MoritzMuehlenhoff) >>! In T209300#4740139, @Ottomata wrote: > I'm fairly cer... [17:44:20] !log jynus@deploy1001 Synchronized wmf-config/db-eqiad.php: Optimize s2 for throughput (duration: 00m 53s) [17:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:58] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10MoritzMuehlenhoff) >>! In T209300#4740111, @Ottomata wrote: > I believe we h... [17:46:11] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10Ottomata) Oh hm. There are no prod services running on the stat boxes. We... [17:46:58] (03CR) 10jenkins-bot: mariadb: Optimize s2 for throughput, not for latency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473051 (owner: 10Jcrespo) [17:48:02] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10Ottomata) The only one there that we should check on for sure is wdqs1009.eq... [17:49:05] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10MoritzMuehlenhoff) I think there are also some inconsistencies in the applic... [17:56:40] (03PS1) 10Elukey: hive-env.sh: add HIVE_SKIP_SPARK_ASSEMBLY=true [puppet/cdh] - 10https://gerrit.wikimedia.org/r/473053 [17:56:48] ottomata: --^ [18:00:00] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [18:00:04] gehel and onimisionipe: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181112T1800). [18:00:48] here! [18:03:14] !log rolling restart of aqs on aqs* to pick up new druid datasource settings [18:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:30] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10Ottomata) > It has pros and cons: The downside of using backports is that it... [18:03:40] (03PS4) 10Elukey: Add timer importing page-history dumps to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/472472 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal) [18:04:38] (03CR) 10Elukey: [C: 032] Add timer importing page-history dumps to hadoop [puppet] - 10https://gerrit.wikimedia.org/r/472472 (https://phabricator.wikimedia.org/T202489) (owner: 10Joal) [18:10:05] (03PS1) 10Milimetric: Add reportupdater job for language team metric [puppet] - 10https://gerrit.wikimedia.org/r/473056 (https://phabricator.wikimedia.org/T207765) [18:13:03] (03CR) 10Elukey: [C: 032] Add reportupdater job for language team metric [puppet] - 10https://gerrit.wikimedia.org/r/473056 (https://phabricator.wikimedia.org/T207765) (owner: 10Milimetric) [18:17:47] !log onimisionipe@deploy1001 Started deploy [wdqs/wdqs@ee91c41]: GUI update, New Thesaurus endpoint, New updater build and blazegraph update [18:17:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:19] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287 (10CDanis) We observed overheating symptoms on the following machines today: mw[1221-1227,1229,1231-1235,1238,1240-1248,1250-1251,1253,1255].eqiad.wmnet [18:28:35] (03CR) 10Joal: [C: 031] "LGTM !" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/473053 (owner: 10Elukey) [18:29:16] !log onimisionipe@deploy1001 Finished deploy [wdqs/wdqs@ee91c41]: GUI update, New Thesaurus endpoint, New updater build and blazegraph update (duration: 11m 28s) [18:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:39] (03CR) 10Ottomata: [C: 031] hive-env.sh: add HIVE_SKIP_SPARK_ASSEMBLY=true [puppet/cdh] - 10https://gerrit.wikimedia.org/r/473053 (owner: 10Elukey) [18:37:25] (03PS2) 10Giuseppe Lavagetto: profile::mediawiki::php: allow picking a php version [puppet] - 10https://gerrit.wikimedia.org/r/470865 (https://phabricator.wikimedia.org/T208433) [18:37:27] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: install extensions with versioned package names [puppet] - 10https://gerrit.wikimedia.org/r/473058 (https://phabricator.wikimedia.org/T208433) [18:52:48] (03PS2) 10Ottomata: Add rsync module on thorium for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/473050 [18:54:08] (03CR) 10Ottomata: [C: 032] Add rsync module on thorium for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/473050 (owner: 10Ottomata) [19:00:04] Deploy window Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181112T1900) [19:00:04] Zoranzoki21: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:15] ok :) [19:01:19] 10Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287 (10CDanis) Many of these machines are always running hot -- ambient temps of 85C or more, even when only lightly loaded. I whipped up a quick Grafana graph of some of the temperatures: https://grafana... [19:02:01] Who will SWAT? [19:03:41] It is a holiday [19:04:10] If is holiday, why we can add patches for SWAT? [19:04:22] https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181112T1900 [19:04:37] I just mean normal swatters might be on vacation today [19:04:46] bawolff: Ok [19:05:00] i guess i can do it if nobody is here [19:05:55] Can you please? [19:07:25] Ok, just a minute [19:08:04] (03PS4) 10Zoranzoki21: Enable moving files for users with patrol and rollbacker rights on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471534 (https://phabricator.wikimedia.org/T208663) [19:11:46] I assume its cool I step into swat's shoes. I'm not a SWAT team member [19:12:18] bawolff_: You should :) [19:12:26] ok [19:14:11] (03CR) 10Brian Wolff: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471534 (https://phabricator.wikimedia.org/T208663) (owner: 10Zoranzoki21) [19:15:19] (03Merged) 10jenkins-bot: Enable moving files for users with patrol and rollbacker rights on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471534 (https://phabricator.wikimedia.org/T208663) (owner: 10Zoranzoki21) [19:15:34] (03CR) 10jenkins-bot: Enable moving files for users with patrol and rollbacker rights on srwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/471534 (https://phabricator.wikimedia.org/T208663) (owner: 10Zoranzoki21) [19:17:03] bawolff_: Can I test my change now? [19:18:01] Zoranzoki21: should be live on mwdebug1002 now [19:18:49] bawolff_: Let me check [19:19:25] bawolff_: Works, move it at production [19:19:29] ok [19:21:59] !log bawolff@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT T208663 4ff32d1df - Enable moving files for users with patrol and rollbacker rights on srwiki (duration: 00m 54s) [19:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:03] T208663: Enable "File mover" flag on sr.wikipedia - https://phabricator.wikimedia.org/T208663 [19:22:18] Zoranzoki21: ok, done [19:22:33] bawolff_: Works in production, thanks! [19:34:32] (03CR) 10Elukey: [V: 032 C: 032] hive-env.sh: add HIVE_SKIP_SPARK_ASSEMBLY=true [puppet/cdh] - 10https://gerrit.wikimedia.org/r/473053 (owner: 10Elukey) [19:37:18] (03PS1) 10Elukey: Update cdh submodule to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/473071 [19:37:41] (03PS3) 10GTirloni: toolforge: refactor mail server [puppet] - 10https://gerrit.wikimedia.org/r/471730 (https://phabricator.wikimedia.org/T208579) (owner: 10Arturo Borrero Gonzalez) [19:37:47] (03CR) 10Elukey: [V: 032 C: 032] Update cdh submodule to latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/473071 (owner: 10Elukey) [19:41:00] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Let's make this happen! @Andrew are you ok with cloudvirt-anXXXX? @Cmjohnson would prefer to coordinate racki... [19:52:39] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:56:00] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:04:41] (03CR) 10GTirloni: [C: 032] toolforge: refactor mail server [puppet] - 10https://gerrit.wikimedia.org/r/471730 (https://phabricator.wikimedia.org/T208579) (owner: 10Arturo Borrero Gonzalez) [20:04:53] (03PS4) 10GTirloni: toolforge: refactor mail server [puppet] - 10https://gerrit.wikimedia.org/r/471730 (https://phabricator.wikimedia.org/T208579) (owner: 10Arturo Borrero Gonzalez) [20:04:59] (03CR) 10Framawiki: Disable FlaggedRevs on srwikinews, add autopatrol, patrol and rollbacker rights and enable RC patrol (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [20:05:45] PROBLEM - Memory correctable errors -EDAC- on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad%2520prometheus%252Fops [20:07:15] PROBLEM - Host lvs2006 is DOWN: PING CRITICAL - Packet loss = 100% [20:13:06] (03CR) 10Framawiki: [C: 04-1] Disable FlaggedRevs on srwikinews, add autopatrol, patrol and rollbacker rights and enable RC patrol [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [20:16:09] (03CR) 10Platonides: "Note: Should not be merged until after 18th November at 02:30 UTC+1 and confirming that community consensus was reached." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/472745 (https://phabricator.wikimedia.org/T209251) (owner: 10Zoranzoki21) [20:19:08] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10herron) >>! In T209300#4740170, @Ottomata wrote: > The only one there that w... [20:19:53] 10Operations, 10Analytics, 10Wikimedia-Logstash, 10Core Platform Team Backlog (Watching / External), and 2 others: Review and make librdkafka-0.11.6 installable from stretch-wikimedia - https://phabricator.wikimedia.org/T209300 (10Ottomata) Ok +1 [20:33:57] (03PS1) 10Niedzielski: Prod: Enable Schema.org page split test at 1% sampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) [20:35:06] (03CR) 10Niedzielski: [C: 04-1] "@phuedx, @addshore, this is the proposed configuration to be deployed *this Wednesday European morning*. Please do not merge before then." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/473079 (https://phabricator.wikimedia.org/T208755) (owner: 10Niedzielski) [20:53:24] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-fgiunchedi: Setup rsyslog to be able to produce logs to Kafka - https://phabricator.wikimedia.org/T206633 (10herron) [21:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: How many deployers does it take to do Services – Parsoid / Citoid / Mobileapps / ORES / … deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181112T2100). [22:00:04] bawolff and Reedy: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181112T2200). [22:04:31] (03PS2) 10Herron: logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) [22:04:56] (03CR) 10jerkins-bot: [V: 04-1] logstash: add generic kafka input config [puppet] - 10https://gerrit.wikimedia.org/r/470454 (https://phabricator.wikimedia.org/T206454) (owner: 10Herron) [22:27:04] PROBLEM - Check systemd state on ms-be1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:28:22] (03PS1) 10Herron: WIP: logstash::input::kafka add support for SSL/TLS options [puppet] - 10https://gerrit.wikimedia.org/r/473137 (https://phabricator.wikimedia.org/T206454) [22:28:24] (03PS1) 10Herron: WIP: logstash::input::kafka: add topics_prefix support [puppet] - 10https://gerrit.wikimedia.org/r/473138 (https://phabricator.wikimedia.org/T206454) [22:49:01] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Andrew) I'd prefer without the dash -- just cloudvirtan1XXX if cloudvirtanalytics1xxx won't fit. [22:53:10] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Figure out networking details for new cloud-analytics-eqiad Hadoop/Presto cluster - https://phabricator.wikimedia.org/T207321 (10Ottomata) Ok, @Cmjohnson your call then: we'd prefer cloudvirtanalytics1xxx, but if that is too long, then use cloudvirta... [23:01:00] How can should we now proceed? :D [23:13:14] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:13:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:37:35] PROBLEM - Long running screen/tmux on an-coord1001 is CRITICAL: CRIT: Long running SCREEN process. (user: jmm PID: 64941, 2459858s 1728000s). [23:40:45] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:41:44] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down