[00:00:04] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:19:34] PROBLEM - puppet last run on labtestvirt2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:20:53] RoanKattouw, https://gerrit.wikimedia.org/r/368331 [00:20:54] PROBLEM - dhclient process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:04] PROBLEM - salt-minion processes on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:21:34] PROBLEM - nutcracker process on thumbor1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:22:45] RECOVERY - dhclient process on thumbor1003 is OK: PROCS OK: 0 processes with command name dhclient [00:22:54] RECOVERY - salt-minion processes on thumbor1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:23:34] RECOVERY - nutcracker process on thumbor1003 is OK: PROCS OK: 1 process with UID = 111 (nutcracker), command name nutcracker [00:27:12] !log bromine sudo -E reprepro clearvanished to deleted unused precise-mediawiki causing reprepro errors [00:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:14] PROBLEM - puppet last run on silver is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:35:07] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone: Unable to locate package geoipupdate - https://phabricator.wikimedia.org/T171916#3480251 (10MaxSem) [00:35:35] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone: Unable to locate package geoipupdate - https://phabricator.wikimedia.org/T171916#3479983 (10MaxSem) Still fails, with even more errors (I tried on a fresh VM). [00:48:54] PROBLEM - Check whether ferm is active by checking the default input chain on bromine is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [00:51:22] ^ me [00:51:38] !log releases1001 - rsynced reprepro db data from bromine [00:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:54] RECOVERY - Check whether ferm is active by checking the default input chain on bromine is OK: OK ferm input default policy is set [00:55:25] (03PS1) 10Rush: Revert "openstack: move openstack::repo to new model" [puppet] - 10https://gerrit.wikimedia.org/r/368332 [00:55:56] (03CR) 10Rush: [V: 032 C: 032] Revert "openstack: move openstack::repo to new model" [puppet] - 10https://gerrit.wikimedia.org/r/368332 (owner: 10Rush) [00:56:22] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3480257 (10Dzahn) 17:27 < mutante> !log bromine sudo -E reprepro clearvanished to deleted unused precise-mediawik... [00:57:16] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone: Unable to locate package geoipupdate - https://phabricator.wikimedia.org/T171916#3480258 (10bd808) >>! In T171916#3480251, @MaxSem wrote: > Still fails, with even more errors (I tried on a fresh VM). Is this a jessie|stretch base image? For //"reasons"//... [00:58:48] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone: Unable to locate package geoipupdate - https://phabricator.wikimedia.org/T171916#3480261 (10MaxSem) Stretch. [00:59:15] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:00:24] RECOVERY - puppet last run on silver is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:10:23] (03PS1) 10Dzahn: releases: rsync reprepro data, set active server in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) [01:11:26] (03CR) 10jerkins-bot: [V: 04-1] releases: rsync reprepro data, set active server in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [01:11:28] (03PS2) 10Dzahn: releases: rsync reprepro data, set active server in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) [01:12:28] (03CR) 10jerkins-bot: [V: 04-1] releases: rsync reprepro data, set active server in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [01:12:58] (03PS3) 10Dzahn: releases: rsync reprepro data, set active server in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) [01:13:19] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone on stretch: Unable to locate package geoipupdate - https://phabricator.wikimedia.org/T171916#3480267 (10bd808) [01:16:15] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:18:20] (03PS2) 10Ayounsi: Assign internal IPs to pfw3-codfw<->pfw3-eqiad ipsec link [dns] - 10https://gerrit.wikimedia.org/r/367933 (https://phabricator.wikimedia.org/T169643) [01:18:55] RECOVERY - puppet last run on labtestvirt2002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [01:19:30] (03PS1) 10Dzahn: install_server: add install2001 to DHCP, partman [puppet] - 10https://gerrit.wikimedia.org/r/368334 (https://phabricator.wikimedia.org/T171917) [01:20:57] (03PS2) 10Dzahn: install_server: add install2001 to DHCP, partman [puppet] - 10https://gerrit.wikimedia.org/r/368334 (https://phabricator.wikimedia.org/T171917) [01:21:27] (03CR) 10Ayounsi: [C: 032] Assign internal IPs to pfw3-codfw<->pfw3-eqiad ipsec link [dns] - 10https://gerrit.wikimedia.org/r/367933 (https://phabricator.wikimedia.org/T169643) (owner: 10Ayounsi) [01:23:36] (03CR) 10Dzahn: [C: 032] install_server: add install2001 to DHCP, partman [puppet] - 10https://gerrit.wikimedia.org/r/368334 (https://phabricator.wikimedia.org/T171917) (owner: 10Dzahn) [01:32:57] (03PS2) 10Dzahn: Revert "Set debug_level on icinga" [puppet] - 10https://gerrit.wikimedia.org/r/366876 (owner: 10Jcrespo) [01:34:35] 10Operations, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup releases2001.codfw.wmnet - https://phabricator.wikimedia.org/T171917#3480300 (10Dzahn) ``` [!!] Install the GRUB boot loader on a hard disk ├┐ │... [01:36:53] (03CR) 10Dzahn: [C: 032] Revert "Set debug_level on icinga" [puppet] - 10https://gerrit.wikimedia.org/r/366876 (owner: 10Jcrespo) [02:02:58] 10Operations, 10Patch-For-Review: create endowment.wm.org microsite - https://phabricator.wikimedia.org/T136735#3480311 (10Dzahn) is this site still being planned? it's over a year later [02:08:15] !log stat1002: disabled puppet, umounted /tmp, /home and /a, poweroff [02:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:45] (03CR) 10Dzahn: "Current Status: CRITICAL" [puppet] - 10https://gerrit.wikimedia.org/r/365640 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [02:18:00] 10Operations, 10Analytics, 10Analytics-Cluster: thorium - failed git clone of geowiki-data-private - https://phabricator.wikimedia.org/T171923#3480324 (10Dzahn) [02:18:09] ACKNOWLEDGEMENT - puppet last run on thorium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-data-private] daniel_zahn https://phabricator.wikimedia.org/T171923 [02:19:03] ACKNOWLEDGEMENT - Check systemd state on mw1260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn scheduled host downtime [02:19:03] ACKNOWLEDGEMENT - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[jobchron],Service[jobrunner] daniel_zahn scheduled host downtime [02:21:41] 10Operations, 10cloud-services-team: notebook100[12] - Invalid relationship: Apt::Pin[r-base] - https://phabricator.wikimedia.org/T171924#3480338 (10Dzahn) [02:22:01] ACKNOWLEDGEMENT - puppet last run on notebook1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://phabricator.wikimedia.org/T171924 [02:22:14] ACKNOWLEDGEMENT - puppet last run on notebook1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn https://phabricator.wikimedia.org/T171924 [02:24:49] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3083419 (10Dzahn) on scb100**2** ``` Current Status: CRITICAL (for 0d 5h 51m... [02:26:48] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [02:26:56] !log scb1002 - systemctl restart pdfrender - was "connect to address 10.64.16.21 and port 5252: Connection refused" in Icinga since a couple hours (T159922) - recovered [02:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:06] T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922 [02:27:33] 10Operations, 10Electron-PDFs, 10Patch-For-Review, 10Reading-Web-Backlog (Tracking), 10Services (blocked): pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003 - https://phabricator.wikimedia.org/T159922#3480357 (10Dzahn) >>! In T159922#3480356, @Stashbot wrote: > {nav icon=file, name=Me... [02:34:54] (03PS2) 10Andrew Bogott: m5-master: allow labspuppet@labpuppetmaster1001 and 1002 to labspuppet [puppet] - 10https://gerrit.wikimedia.org/r/368251 [02:47:33] (03PS1) 10Andrew Bogott: labs puppetmaster: rebase from gerrit once per minute [puppet] - 10https://gerrit.wikimedia.org/r/368339 [02:47:35] (03CR) 10Andrew Bogott: [C: 032] m5-master: allow labspuppet@labpuppetmaster1001 and 1002 to labspuppet [puppet] - 10https://gerrit.wikimedia.org/r/368251 (owner: 10Andrew Bogott) [02:48:12] (03PS2) 10Andrew Bogott: labs puppetmaster: rebase from gerrit once per minute [puppet] - 10https://gerrit.wikimedia.org/r/368339 [02:50:05] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: rebase from gerrit once per minute [puppet] - 10https://gerrit.wikimedia.org/r/368339 (owner: 10Andrew Bogott) [03:01:51] (03PS1) 10Andrew Bogott: puppetmaster profiles: add prevent_cherrypicks param [puppet] - 10https://gerrit.wikimedia.org/r/368340 [03:02:53] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster profiles: add prevent_cherrypicks param [puppet] - 10https://gerrit.wikimedia.org/r/368340 (owner: 10Andrew Bogott) [03:04:38] (03PS2) 10Andrew Bogott: puppetmaster profiles: add prevent_cherrypicks param [puppet] - 10https://gerrit.wikimedia.org/r/368340 [03:06:52] (03PS3) 10Andrew Bogott: puppetmaster profiles: add prevent_cherrypicks param [puppet] - 10https://gerrit.wikimedia.org/r/368340 [03:10:18] (03CR) 10Andrew Bogott: [C: 032] puppetmaster profiles: add prevent_cherrypicks param [puppet] - 10https://gerrit.wikimedia.org/r/368340 (owner: 10Andrew Bogott) [03:26:29] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 721.24 seconds [03:30:45] (03PS1) 10Andrew Bogott: labs puppetmaster backend: open firewall on 8141 [puppet] - 10https://gerrit.wikimedia.org/r/368342 [03:32:45] (03PS2) 10Andrew Bogott: labs puppetmaster backend: open firewall on 8141 [puppet] - 10https://gerrit.wikimedia.org/r/368342 [03:35:26] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster backend: open firewall on 8141 [puppet] - 10https://gerrit.wikimedia.org/r/368342 (owner: 10Andrew Bogott) [03:43:38] (03PS1) 10Andrew Bogott: labs puppetmasters: Let the puppetmasters talk to each other on 8141 [puppet] - 10https://gerrit.wikimedia.org/r/368343 [03:44:39] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmasters: Let the puppetmasters talk to each other on 8141 [puppet] - 10https://gerrit.wikimedia.org/r/368343 (owner: 10Andrew Bogott) [03:47:36] (03PS2) 10Andrew Bogott: labs puppetmasters: Let the puppetmasters talk to each other on 8141 [puppet] - 10https://gerrit.wikimedia.org/r/368343 [03:48:51] (03CR) 10Andrew Bogott: [C: 032] labs puppetmasters: Let the puppetmasters talk to each other on 8141 [puppet] - 10https://gerrit.wikimedia.org/r/368343 (owner: 10Andrew Bogott) [04:05:39] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 210.35 seconds [04:09:28] (03PS1) 10Andrew Bogott: define puppetmaster::servers for labpuppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/368345 [04:11:03] (03CR) 10Andrew Bogott: [C: 032] define puppetmaster::servers for labpuppetmaster1002 [puppet] - 10https://gerrit.wikimedia.org/r/368345 (owner: 10Andrew Bogott) [04:25:35] (03PS1) 10Andrew Bogott: labs puppetmaster: simplify allow_from rules [puppet] - 10https://gerrit.wikimedia.org/r/368347 [04:26:38] (03CR) 10jerkins-bot: [V: 04-1] labs puppetmaster: simplify allow_from rules [puppet] - 10https://gerrit.wikimedia.org/r/368347 (owner: 10Andrew Bogott) [04:28:20] (03PS2) 10Andrew Bogott: labs puppetmaster: simplify allow_from rules [puppet] - 10https://gerrit.wikimedia.org/r/368347 [04:29:44] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: simplify allow_from rules [puppet] - 10https://gerrit.wikimedia.org/r/368347 (owner: 10Andrew Bogott) [04:42:58] PROBLEM - HP RAID on ms-be1017 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging [04:43:02] ACKNOWLEDGEMENT - HP RAID on ms-be1017 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Cable Error - Battery/Capacitor: Recharging nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T171926 [04:43:07] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3480411 (10ops-monitoring-bot) [05:19:43] (03Abandoned) 10Krinkle: mediawiki: update apache rules for 2.4 [puppet] - 10https://gerrit.wikimedia.org/r/225552 (owner: 10Gergő Tisza) [05:21:29] PROBLEM - puppet last run on db1063 is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 2 minutes ago with 4 failures. Failed resources (up to 3 shown): Package[tzdata],Exec[wikidev_ensure_members],Exec[ops_ensure_members],Exec[absent_ensure_members] [05:48:58] RECOVERY - puppet last run on db1063 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:09:39] RECOVERY - WDQS SPARQL on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 13348 bytes in 0.001 second response time [06:09:59] RECOVERY - WDQS HTTP on wdqs1001 is OK: HTTP OK: HTTP/1.1 200 OK - 13348 bytes in 0.001 second response time [06:12:09] wikidata changes stopped about 20 mins ago - anybody knows the reason? [06:27:29] PROBLEM - High lag on wdqs2002 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1800.0] [06:27:39] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [06:28:09] PROBLEM - High lag on wdqs1003 is CRITICAL: CRITICAL: 34.48% of data above the critical threshold [1800.0] [06:28:18] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [06:30:18] PROBLEM - High lag on wdqs1002 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [06:30:19] PROBLEM - High lag on wdqs2003 is CRITICAL: CRITICAL: 31.03% of data above the critical threshold [1800.0] [06:33:39] PROBLEM - High lag on wdqs2001 is CRITICAL: CRITICAL: 44.83% of data above the critical threshold [1800.0] [06:35:20] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480493 (10Esc3300) [06:35:54] !log installing apache security updates on trusty systems [06:36:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:58] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[apache2] [06:45:36] 10Operations, 10Android-app-feature-Compilations, 10Traffic, 10Wikipedia-Android-App-Backlog, 10Reading-Infrastructure-Team-Backlog (Kanban): Determine where to host zim files for the Android app - https://phabricator.wikimedia.org/T170843#3480500 (10Nemo_bis) [06:51:46] "read-only wiki" while adding wikidata link. Known outage? [06:52:08] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:56:53] 10Operations, 10Puppet, 10Traffic, 10Mobile, and 2 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#2904582 (10Nemo_bis) Can you guarantee to support all the URLs with parameters which would get... [06:57:24] kart_: the only thing that we know afaics is the alerts for Wikidata Query Service lag [06:57:51] ah ok unbreak now - https://phabricator.wikimedia.org/T171928 [06:58:28] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480510 (10Esc3300) Wikivoyage seems to work. [06:59:11] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480468 (10Mbch331) Dutch Wikipedia also works. So it's not all projects. [07:04:43] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480468 (10jcrespo) Database crashed, it should be ok to edit now. [07:04:46] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3480518 (10MoritzMuehlenhoff) p:05Triage>03Normal a:03Cmjohnson [07:05:26] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480468 (10Joe) I just did two test edits, I can confirm it works. [07:12:14] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480528 (10Esc3300) Yes, it's back! Thanks for your help. ``` (diff | hist) . . 99minutos.com (Q33542455)‎; 07:03 . . (-95)‎ . . ‎Tarawa1943 (talk | contribs)‎ (‎Page on [eswiki] deleted: 99minutos.com) [rollback] (... [07:12:22] kart_: ---^ [07:12:34] thanks for the ping [07:12:58] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480529 (10Esc3300) p:05Unbreak!>03Triage [07:19:32] elukey: thanks. I was about to report it :) [07:26:28] RECOVERY - High lag on wdqs1003 is OK: OK: Less than 30.00% above the threshold [600.0] [07:26:38] RECOVERY - High lag on wdqs2003 is OK: OK: Less than 30.00% above the threshold [600.0] [07:26:58] RECOVERY - High lag on wdqs2001 is OK: OK: Less than 30.00% above the threshold [600.0] [07:27:29] RECOVERY - High lag on wdqs1002 is OK: OK: Less than 30.00% above the threshold [600.0] [07:27:48] RECOVERY - High lag on wdqs2002 is OK: OK: Less than 30.00% above the threshold [600.0] [07:48:58] 10Operations, 10cloud-services-team: notebook100[12] - Invalid relationship: Apt::Pin[r-base] - https://phabricator.wikimedia.org/T171924#3480579 (10MoritzMuehlenhoff) p:05Triage>03High Seems like a side effect of 7dfe90c0d494999e2cfc05b12169401d40d54c99 ? [07:49:25] 10Operations, 10ORES, 10Scoring-platform-team-Backlog: Reimage ores* hosts with Debian Stretch - https://phabricator.wikimedia.org/T171851#3480582 (10MoritzMuehlenhoff) p:05Triage>03Normal [07:52:27] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wdqs1001.eqiad.wmnet [07:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:38] !log forced mii-tool -r eth0 on analytics1034 to get 1G negotiated speed [07:52:40] !log repooling wdqs1001 (data import completed) [07:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:43] !log update nodejs to 6.11 on aqs1004 (testing prod node after beta qa) - T170790 [07:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:53] T170790: Upgrade AQS to node 6.11 - https://phabricator.wikimedia.org/T170790 [08:01:53] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3376126 (10Crochet.david) >>! In T168765#3477698, @Jayprakash12345 wrote: > > https://commons.wikimedia.org/wiki/File:Wikiversity-logo-hi.s... [08:04:58] RECOVERY - puppet last run on thorium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:05:28] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3480635 (10Crochet.david) >>! In T168765#3468785, @Jayprakash12345 wrote: > Will Quiz Extension be install automatically at the time of wiki... [08:09:06] 10Operations, 10Commons, 10Traffic, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3480638 (10ema) 05Open>03Resolved a:03ema We do have occasional backend fetch failures. Closing, as this looks like a transient error. [08:11:40] 10Operations, 10Analytics, 10Analytics-Cluster: thorium - failed git clone of geowiki-data-private - https://phabricator.wikimedia.org/T171923#3480643 (10elukey) This issue has already happened in the past, this brutal sequence of commands fixed it: ``` root@thorium:/srv/geowiki# rm -rf data-private root@th... [08:12:06] 10Operations, 10Analytics, 10Analytics-Cluster, 10User-Elukey: thorium - failed git clone of geowiki-data-private - https://phabricator.wikimedia.org/T171923#3480644 (10elukey) p:05Triage>03Normal [08:33:45] (03PS4) 10Filippo Giunchedi: Don't show diffs for files with secret content [puppet] - 10https://gerrit.wikimedia.org/r/366806 (https://phabricator.wikimedia.org/T79881) [08:33:56] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3480703 (10Jayprakash12345) >>! In T168765#3480603, @Crochet.david wrote: >>>! In T168765#3477698, @Jayprakash12345 wrote: >> >> https://co... [08:34:13] (03PS6) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [08:35:02] (03CR) 10jerkins-bot: [V: 04-1] Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 (owner: 10Giuseppe Lavagetto) [08:36:37] (03CR) 10Filippo Giunchedi: [C: 032] Don't show diffs for files with secret content [puppet] - 10https://gerrit.wikimedia.org/r/366806 (https://phabricator.wikimedia.org/T79881) (owner: 10Filippo Giunchedi) [08:38:46] ugh, incoming puppet shower, sorry about that [08:39:08] shall we stop ircecho [08:39:10] ? [08:39:28] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:38] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:38] PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:46] elukey: fixing as we speak, but yeah if you could stop it! [08:39:48] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:48] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:48] PROBLEM - puppet last run on restbase-test2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:49] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:49] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:58] PROBLEM - puppet last run on dbstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:58] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:58] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:39:59] PROBLEM - puppet last run on wasat is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:07] (03PS1) 10Filippo Giunchedi: profile: fix bogus show_diff for ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/368361 [08:40:08] PROBLEM - puppet last run on analytics1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:08] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:08] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:10] PROBLEM - puppet last run on wdqs1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:10] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:10] PROBLEM - puppet last run on db1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:18] PROBLEM - puppet last run on db2073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:18] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:18] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:18] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:18] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:19] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:26] * elukey stops [08:40:28] PROBLEM - puppet last run on radon is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:28] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:28] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:28] PROBLEM - puppet last run on lvs2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:34] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] profile: fix bogus show_diff for ssh::userkey [puppet] - 10https://gerrit.wikimedia.org/r/368361 (owner: 10Filippo Giunchedi) [08:40:38] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:38] PROBLEM - puppet last run on rdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:38] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:38] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:38] PROBLEM - puppet last run on chlorine is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:38] PROBLEM - puppet last run on wtp1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:38] PROBLEM - puppet last run on elastic1051 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:39] PROBLEM - puppet last run on mw2257 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:39] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:40] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:40] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] PROBLEM - puppet last run on analytics1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] PROBLEM - puppet last run on db2089 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] PROBLEM - puppet last run on oresrdb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:40:48] PROBLEM - puppet last run on mw2210 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:41:29] !log stop ircecho on einstenium as puppet-error-shower countermeasure [08:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:28] 10Operations, 10Wikidata: Wikidata database locked - https://phabricator.wikimedia.org/T171928#3480745 (10jcrespo) a:03jcrespo Investigation is not over, here is what we have found out for now of the causes: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170728-s5_(WikiData_and_dewiki)_read-only [09:07:29] !log installing apache security updates on puppet masters [09:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:24] (03PS7) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [09:12:05] 10Operations, 10Wikidata: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928#3480757 (10jcrespo) [09:14:30] 10Operations, 10Wikidata, 10Wikimedia-Incident: Wikidata and dewiki databases locked - https://phabricator.wikimedia.org/T171928#3480761 (10Peachey88) [09:25:19] (03PS8) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [09:26:10] (03CR) 10jerkins-bot: [V: 04-1] Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 (owner: 10Giuseppe Lavagetto) [09:28:45] 10Operations, 10monitoring, 10User-fgiunchedi: Update diamond to latest upstream version - https://phabricator.wikimedia.org/T97635#3480780 (10fgiunchedi) 05Open>03stalled The `--log-stdout` issue has been filed as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=869970 As for the slow shutdown I've re... [09:32:01] (03PS9) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [09:36:50] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1017 - https://phabricator.wikimedia.org/T171926#3480786 (10fgiunchedi) p:05Normal>03High @Cmjohnson I suspect this is again the battery dying and needs replacement, same as {T171183} [09:37:05] 10Operations, 10ops-eqiad, 10User-fgiunchedi: Degraded RAID on ms-be1016 - https://phabricator.wikimedia.org/T171183#3456811 (10fgiunchedi) p:05Normal>03High [09:39:47] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3480794 (10elukey) Since the kafka1012->kafka1022 are going to be decommed and kafka-jumbo is a complete new cluster from our... [09:41:41] !log re-enable irc-echo on einstenium [09:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:52] (03PS10) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [09:42:58] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [09:43:48] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [09:52:32] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3480813 (10gh87) @Jayprakash12345 You can ask at [[https://commons.wikimedia.org/wiki/Commons:Graphic_Lab/Illustration_workshop|Commons:Grap... [09:53:25] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3480814 (10Urbanecm) [10:15:26] (03PS1) 10ArielGlenn: do rsyncs of pageviews and other items from stat1005 now instead of stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/368383 [10:15:38] thanksss apergos ! [10:15:42] I was about to do it :) [10:15:53] I just seen the task [10:16:31] can we put the value in heira though? [10:17:34] we can, but that should be part of setting up the new labstore hosts, which will be taking over the dataset roles. [10:17:39] or rather, some of the dataset roles. [10:18:01] okok :) [10:19:24] (03CR) 10ArielGlenn: [C: 032] do rsyncs of pageviews and other items from stat1005 now instead of stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/368383 (owner: 10ArielGlenn) [10:20:02] (03PS5) 10Ema: pybal::monitoring: add check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/367662 (https://phabricator.wikimedia.org/T134893) [10:20:26] (03CR) 10Ema: [V: 032 C: 032] pybal::monitoring: add check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/367662 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [10:29:54] (03PS11) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [10:31:11] !log upgrading and restarting labsdb1009 and labsdb1011 [10:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:23] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3480849 (10Urbanecm) @Dereckson Can you reserve a window for this? [10:40:34] (03PS1) 10Jcrespo: labsdb-replicas: Update new labsdb hosts to stretch/systemd [puppet] - 10https://gerrit.wikimedia.org/r/368391 (https://phabricator.wikimedia.org/T153743) [10:42:37] (03PS2) 10Jcrespo: labsdb-replicas: Update new labsdb hosts to stretch/systemd [puppet] - 10https://gerrit.wikimedia.org/r/368391 (https://phabricator.wikimedia.org/T153743) [10:47:53] (03CR) 10Jcrespo: [C: 032] labsdb-replicas: Update new labsdb hosts to stretch/systemd [puppet] - 10https://gerrit.wikimedia.org/r/368391 (https://phabricator.wikimedia.org/T153743) (owner: 10Jcrespo) [10:58:15] (03PS1) 10Muehlenhoff: Add scons to package list [puppet] - 10https://gerrit.wikimedia.org/r/368399 [11:01:05] (03CR) 10Urbanecm: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367892 (https://phabricator.wikimedia.org/T171501) (owner: 10MarcoAurelio) [11:08:28] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [11:09:29] The proxy server received an invalid response from an upstream server. [11:09:46] indeed, doesn't seem very happy, taking a look [11:10:28] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1547 bytes in 1.737 second response time [11:11:28] mhh recovered by itself [11:11:32] yeah [11:13:24] from the apache logs it seems that uwsgi was off for a bit [11:16:36] yeah using a lot of cpu too [11:17:14] (03CR) 10Muehlenhoff: [C: 032] Add scons to package list [puppet] - 10https://gerrit.wikimedia.org/r/368399 (owner: 10Muehlenhoff) [11:17:19] (03PS2) 10Muehlenhoff: Add scons to package list [puppet] - 10https://gerrit.wikimedia.org/r/368399 [11:19:08] from ~10:58 [11:20:21] yeah I'm looking at the graphite-web logs to see if a query stands out [11:21:18] 294 2017-07-28T10:56 [11:21:18] 282 2017-07-28T10:57 [11:21:18] 4 2017-07-28T10:58 [11:21:18] 2 2017-07-28T10:57 [11:21:19] 3 2017-07-28T10:58 [11:21:26] reqs from the apache logs [11:43:03] (03CR) 10Daniel Kinzler: "@hoo that sounds like a good suggestion. Can you make a ticket or patch for doing this?" [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [11:46:41] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3480961 (10MoritzMuehlenhoff) Status update: The new jessie scaler has been exposed to production traffic and a few files have been identified which cra... [11:47:32] ACKNOWLEDGEMENT - Check systemd state on mw1260 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Muehlenhoff T145742 [11:47:32] ACKNOWLEDGEMENT - puppet last run on mw1260 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[jobchron],Service[jobrunner] Muehlenhoff T145742 [11:58:36] (03CR) 10Ladsgroup: "I can amend this patch if you want to" [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [12:03:00] (03PS1) 10Jcrespo: labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) [12:05:21] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3481017 (10MoritzMuehlenhoff) The test case (https://commons.wikimedia.org/wiki/File:National_Archaeological_Museum_Kabile_-_near_Yambol.webm) also cras... [12:06:55] (03PS2) 10Jcrespo: labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) [12:09:14] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481022 (10Jayprakash12345) @Urbanecm Sir, Our native member made new logo Logo:-https://commons.wikimedia.org/wiki/File:Wikividhyalay_logo... [12:18:59] (03PS3) 10Jcrespo: labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) [12:20:24] (03CR) 10Elukey: Kafka broker profile and roles for new 'aggregate' (TBD) cluster and 'simple' cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/356232 (https://phabricator.wikimedia.org/T166162) (owner: 10Ottomata) [12:20:49] (03PS4) 10Jcrespo: labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) [12:22:24] (03CR) 10Daniel Kinzler: "@Ladsgroup sure, agree on a good config with Hoo and roll it out :)" [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [12:25:44] (03PS5) 10Jcrespo: labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) [12:27:54] (03PS3) 10Ladsgroup: mediawiki: increase the batch size of dispatchChanges cronjob [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) [12:28:43] (03CR) 10Ladsgroup: "Done, @hoo: What do you think?" [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [12:32:02] Hi [12:32:13] I'm getting very poor performance out of Englsih Wikisource [12:32:27] Sometimes it's taking over 2 mins to load pages [12:32:28] (03PS6) 10Jcrespo: labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) [12:32:52] (03PS1) 10Muehlenhoff: Restore access for ladsgroup [puppet] - 10https://gerrit.wikimedia.org/r/368411 (https://phabricator.wikimedia.org/T170801) [12:35:18] (03CR) 10Muehlenhoff: [C: 032] Restore access for ladsgroup [puppet] - 10https://gerrit.wikimedia.org/r/368411 (https://phabricator.wikimedia.org/T170801) (owner: 10Muehlenhoff) [12:37:24] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481082 (10Urbanecm) Okay, ack'ed. Will replace the logos. [12:39:22] (03PS7) 10Urbanecm: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) [12:40:12] (03PS8) 10Urbanecm: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) [12:40:55] 10Operations, 10Commons, 10Traffic, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3481084 (10Jeff_G) >>! In T171421#3469152, @fgiunchedi wrote: > @Aklapper _usually_ traffic since this indicates varnish failure to fetch and most lik... [12:41:53] (03PS9) 10Urbanecm: Initial configuration for hiwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) [12:43:30] ShakespeareFan00: are you logged in, which continent are you? [12:43:41] Logged in - Europe [12:44:31] cannot reproduce, maybe ips issues? [12:44:54] Narrowing it down UK ( Vodafone) [12:44:54] *isp [12:45:28] can you ping wikipedia.org and see if you have package loss? [12:47:22] ShakespeareFan00 works for me on bt. [12:47:36] (03PS3) 10Urbanecm: Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) [12:47:59] I am checking network vendors maintenance or alerts and network alerts and see nothing,but I will keep looking [12:48:34] (03PS4) 10Urbanecm: Initial configuration for wikimania2018wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) [12:49:30] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3481105 (10MoritzMuehlenhoff) Another observation: Using ffmpeg to convert to ogv, the conversion works just fine (tested on stretch, will also repeat o... [12:50:01] no performance issues on the metrics https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-continent [12:50:36] but please provide more info if you have it of page load performance and network issues [12:53:33] (03CR) 10Hoo man: [C: 031] "This should solve (or at least ease) the enwiki dispatch backlog issue for now." [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [12:55:18] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481107 (10Jayprakash12345) >>! In T168765#3481082, @Urbanecm wrote: > Okay, ack'ed. Will replace the logos. Yes sir, please change the lo... [12:55:41] (03CR) 10Jcrespo: "I am ok with this, but please let's deploy on Monday- technically, no deployments should happen on Fridays." [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [12:55:54] 10Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 10Hindi-Sites, and 2 others: Create Wikiversity Hindi - https://phabricator.wikimedia.org/T168765#3481109 (10Urbanecm) Just was done before a moment [12:56:20] 10Operations, 10Commons, 10Traffic, 10media-storage: 503 error for certain JPG thumbnail: "Backend fetch failed" - https://phabricator.wikimedia.org/T171421#3481110 (10ema) >>! In T171421#3481084, @Jeff_G wrote: >>>! In T171421#3469152, @fgiunchedi wrote: >> @Aklapper _usually_ traffic since this indicates... [12:59:22] (03PS1) 10Urbanecm: Revert "Revert "Set initial configuration for techconduct.wikimedia.org"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 [12:59:31] (03CR) 10jerkins-bot: [V: 04-1] Revert "Revert "Set initial configuration for techconduct.wikimedia.org"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (owner: 10Urbanecm) [13:03:02] (03PS2) 10Urbanecm: Revert "Revert "Set initial configuration for techconduct.wikimedia.org"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (https://phabricator.wikimedia.org/T165977) [13:04:06] (03PS3) 10Urbanecm: Revert "Revert "Set initial configuration for techconduct.wikimedia.org"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368415 (https://phabricator.wikimedia.org/T165977) [13:04:08] !log upgrading and restarting db1095 [13:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:27] (03CR) 10Jcrespo: [C: 032] labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) (owner: 10Jcrespo) [13:04:35] (03PS7) 10Jcrespo: labsdb: Rename sanitarium2 to sanitarium multisource [puppet] - 10https://gerrit.wikimedia.org/r/368408 (https://phabricator.wikimedia.org/T153743) [13:06:36] (03PS1) 10Ema: pybal::monitoring: add OK message to check_pybal_ipvs_diff [puppet] - 10https://gerrit.wikimedia.org/r/368416 (https://phabricator.wikimedia.org/T134893) [13:07:52] (03CR) 10Gehel: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/367930 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [13:25:37] (03CR) 10Daniel Kinzler: "Let's hope the backlog doesn't grow huge over the weekend, then..." [puppet] - 10https://gerrit.wikimedia.org/r/366887 (https://phabricator.wikimedia.org/T171263) (owner: 10Ladsgroup) [13:28:36] (03PS3) 10Reception123: Added wordmark for Wikipedia Atikamekw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368198 [13:28:44] (03CR) 10jerkins-bot: [V: 04-1] Added wordmark for Wikipedia Atikamekw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368198 (owner: 10Reception123) [13:29:54] (03CR) 10Reception123: "Conflicts. Rebase failed (still merge conflict). Not sure what I can do" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368198 (owner: 10Reception123) [13:31:40] I'lll assume it's localised then thanks [13:52:35] (03PS1) 10Urbanecm: Optimalize all PNGs in this repo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368423 (https://phabricator.wikimedia.org/T170569) [13:54:24] (03PS1) 10Andrew Bogott: labs puppetmaster: allow all puppetmasters access to the enc api [puppet] - 10https://gerrit.wikimedia.org/r/368424 [14:00:58] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: allow all puppetmasters access to the enc api [puppet] - 10https://gerrit.wikimedia.org/r/368424 (owner: 10Andrew Bogott) [14:01:51] (03PS12) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [14:02:55] (03CR) 10jerkins-bot: [V: 04-1] Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 (owner: 10Giuseppe Lavagetto) [14:03:41] (03PS1) 10Reedy: Make babel use Database and SUL wikis use metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368429 (https://phabricator.wikimedia.org/T145366) [14:05:15] * halfak looks for akosiaris [14:05:44] 10Operations, 10Multimedia, 10TimedMediaHandler, 10HHVM, 10Patch-For-Review: Migrate video scalers to jessie - https://phabricator.wikimedia.org/T145742#3481301 (10MoritzMuehlenhoff) Also tested to work fine with jessie-wikimedia: ffmpeg -i National_Archaeological_Museum_Kabile_-_near_Yambol.webm -codec:... [14:06:15] <_joe_> halfak: he's on PTO [14:06:42] <_joe_> whatever you needed akosiaris for, you'd have to ask someone else in ops :) [14:07:03] <_joe_> and since it's 4 PM on friday, I hope you come bearing gifts :) [14:08:09] (03PS13) 10Giuseppe Lavagetto: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 [14:08:50] Hey _joe_! I've been trying to run a stress test with akosiaris for a while on the new ORES cluster. I was hoping to have some opsen work with me in sync to make sure we learned what we needed to. [14:09:03] (03PS1) 10Andrew Bogott: labs puppetmaster: allow access to the enc api via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/368430 [14:09:44] No gifts I'm afraid. The best I can muster is scheduling a time that's not patently absurd because I live in North America :) [14:09:47] <_joe_> uhm, I don't know much about what alex was doing there [14:10:06] <_joe_> but I guess I can catch-up if needed [14:10:26] (03CR) 10DCausse: [C: 031] "looks like it failed to merge?" [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [14:10:32] (03CR) 10Andrew Bogott: [C: 032] labs puppetmaster: allow access to the enc api via ipv6 [puppet] - 10https://gerrit.wikimedia.org/r/368430 (owner: 10Andrew Bogott) [14:11:11] !log upgrading rhenium to stretch via dist-upgrade [14:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:47] Gist is, we have 9x2 new servers across eqiad and codfw. I think they should be fully puppetized. The plan is to hit them with a little stress testing utility that I made until they keel over. I've updated our dashboard so we can check that. [14:11:55] ores100*.eqiad.wmnet [14:12:21] _joe_, ^ [14:12:46] (03Abandoned) 10Cmjohnson: Adding dns entries for kafka-jumbo100[1-6] T167992 [dns] - 10https://gerrit.wikimedia.org/r/368186 (owner: 10Cmjohnson) [14:13:02] halfak: Stop forkbombing your own servers [14:13:09] <_joe_> rotfl [14:13:41] PROBLEM - DPKG on rhenium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:14:12] :) Reedy [14:14:35] * halfak gets a task [14:14:36] <_joe_> halfak: so, those servers are still not part of any pool, so you'd need to point to them directly [14:14:43] https://phabricator.wikimedia.org/T169246 [14:14:52] <_joe_> so you can test a single server for now [14:15:00] Right. The stress tester will auto-roundrobin a set of hosts. [14:15:04] <_joe_> ok [14:15:49] So, the stress tester requires minimal resources. Should I run it from bast1001? [14:16:19] <_joe_> I would run it from one of the ores hosts themselves, tbh [14:16:35] When that host starts to die, it might affect the stress tester. [14:17:02] <_joe_> uhm, if the death of ores there affects the stress tester, that's a misconfiguration [14:17:10] fair point. [14:17:19] OK! [14:17:27] <_joe_> system should have resources set up so that any small utility can run while ores is under full load :) [14:17:32] I'll choose ores1001 then and get started. [14:17:36] <_joe_> ok [14:17:55] I'll need to run a quick minor test to make sure grafana is picking it up right. [14:17:56] <_joe_> let me know if you want me to take a look at things once you managed to send the service belly up :P [14:18:21] <_joe_> let's start with a couple hosts and then ramp up? [14:18:21] PROBLEM - Check whether ferm is active by checking the default input chain on rhenium is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [14:18:34] _joe_, I'm sure there's means of monitoring what's going on that you know about and I don't. If you can look at them after the fact, I'll just let you know when I'm done. [14:18:53] 10Operations, 10Ops-Access-Requests: Requesting access to mwlog1001.eqiad.wmnet for goransm - https://phabricator.wikimedia.org/T171958#3481349 (10GoranSMilovanovic) [14:19:01] <_joe_> so that we don't risk flooding the mw api if we underestimated your benchmarking tool :P [14:19:35] _joe_, sure. Not sure how this would affect mw api, but OK! [14:20:44] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): instance root passwords vs. multiple puppetmasters - https://phabricator.wikimedia.org/T171959#3481362 (10Andrew) [14:20:52] OK. So problem pulling my utility to this machine because it can't talk to the outside. Sec. [14:21:10] <_joe_> halfak: isn't ores calling the mw api? [14:21:27] <_joe_> how is it fetching revisions/edits if not that way? :) [14:21:40] Oh! derp. GOod point :) [14:21:51] I figured the api is way higher capacity :D [14:21:52] PROBLEM - salt-minion processes on rhenium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:22:08] <_joe_> uhm [14:22:37] * halfak figures out how to move the necessary files around. [14:22:42] <_joe_> I'm not that sure everything is configured like I'd like it to be in these systems [14:23:09] <_joe_> all of them use a local redis instance, AFAICT [14:23:34] Oh... They should share a redis instance [14:23:36] * halfak digs [14:23:39] <_joe_> wait [14:23:45] <_joe_> I'm looking into it [14:24:06] <_joe_> they're definitely using a local redis instance [14:24:27] yup [14:24:29] abort :( [14:24:34] This isn't prod-like [14:24:35] <_joe_> no wait [14:24:38] So it won't work [14:24:46] <_joe_> they're all using a redis instance on ores1001 [14:24:51] <_joe_> at least the ones in eqiad [14:25:14] <_joe_> yup [14:25:17] <_joe_> sorry, brb [14:25:18] Ohhhh... That's not terrible [14:25:20] kk [14:25:35] This is prod-ish [14:25:39] unabort [14:25:50] but I'm not sure we're going to get a good and accurate understanding [14:27:40] <_joe_> the issue being [14:27:51] PROBLEM - puppet last run on rhenium is CRITICAL: Return code of 255 is out of bounds [14:27:52] <_joe_> I didn't inspect the config of that redis instance [14:27:57] <_joe_> I'm going to look now [14:29:02] PROBLEM - MD RAID on rhenium is CRITICAL: Return code of 255 is out of bounds [14:29:06] <_joe_> halfak: it should be ok, I'd suggest to leave ores1001 out of your tests for now [14:29:11] PROBLEM - configured eth on rhenium is CRITICAL: Return code of 255 is out of bounds [14:29:11] PROBLEM - Check systemd state on rhenium is CRITICAL: Return code of 255 is out of bounds [14:29:12] PROBLEM - Check size of conntrack table on rhenium is CRITICAL: Return code of 255 is out of bounds [14:29:15] <_joe_> the list of hosts I mean [14:29:21] PROBLEM - Disk space on rhenium is CRITICAL: Return code of 255 is out of bounds [14:29:22] PROBLEM - dhclient process on rhenium is CRITICAL: Return code of 255 is out of bounds [14:29:24] <_joe_> can someone check on rhenium please? [14:29:27] _joe_, OK will do [14:30:30] checking rhenium [14:30:32] _joe_ checking rhenium [14:30:41] herron: go ahead :) [14:30:58] Confirmed the stress tester can run [14:31:15] ok :D [14:31:24] <_joe_> eheh [14:31:36] <_joe_> halfak: ok, let me know which machines you're targeting [14:31:42] confirmed that grafana reports activity in the cluster [14:31:47] just ores1002/3 right now [14:31:55] On super duper light mode [14:32:40] <_joe_> cool [14:34:00] BTW, the celery queue will make sure that all of the machines get hit for CPU work. [14:34:03] rhenium is fine, Faidon upgraded it to stretch [14:34:09] It'll just distribute however it can [14:34:12] RECOVERY - configured eth on rhenium is OK: OK - interfaces up [14:34:19] s/upgrading/is upgrading/ [14:34:21] RECOVERY - Check size of conntrack table on rhenium is OK: OK: nf_conntrack is 0 % full [14:34:21] RECOVERY - Disk space on rhenium is OK: DISK OK [14:34:22] RECOVERY - dhclient process on rhenium is OK: PROCS OK: 0 processes with command name dhclient [14:34:26] But by hitting specific machines through http, we'll make sure all of the IO (mostly mw api) happens there [14:34:27] I !logged it too, see above [14:34:30] _joe_, ^ [14:34:31] RECOVERY - Check whether ferm is active by checking the default input chain on rhenium is OK: OK ferm input default policy is set [14:34:52] ema: ^ [14:35:02] and _joe_ [14:35:07] <_joe_> paravoid: heh I lost that in the shower of messages here [14:35:11] RECOVERY - MD RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [14:35:33] herron: rhenium is OK ^ [14:36:09] \o/ [14:36:19] * halfak updates some more grafana based on missing stats. [14:37:55] load on s2 seems higher than usual [14:39:24] main traffic, not api [14:40:39] waiting on getting a dataset of random revision IDs for the test... [14:40:41] RECOVERY - DPKG on rhenium is OK: All packages OK [14:40:48] SHould just be 1-2 more minutes. [14:41:09] <_joe_> halfak: the sistems are unimpressed for now :P [14:41:15] <_joe_> *systems [14:41:52] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [14:42:12] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [14:42:51] RECOVERY - salt-minion processes on rhenium is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [14:44:28] (03PS1) 10Andrew Bogott: Only store instance root passwords on the frontend puppetmaster. [puppet] - 10https://gerrit.wikimedia.org/r/368433 (https://phabricator.wikimedia.org/T171959) [14:45:23] OK here we go. Just hitting ores1002/3 [14:46:55] (03CR) 10Andrew Bogott: [C: 032] Only store instance root passwords on the frontend puppetmaster. [puppet] - 10https://gerrit.wikimedia.org/r/368433 (https://phabricator.wikimedia.org/T171959) (owner: 10Andrew Bogott) [14:48:13] <_joe_> halfak: your tool is submitting requests without a revid [14:48:16] <_joe_> AFAICT [14:48:22] Gotcha. [14:48:22] <_joe_> /v3/scores/enwiki/?features=&revids= [14:48:23] Checking [14:48:39] Ahh yes. i can see that in the logging now. [14:49:12] Oh strange [14:49:21] <_joe_> andrewbogott: is storing those passwords a security issue? [14:49:21] PROBLEM - Check systemd state on rhenium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:49:22] (03PS1) 10Jcrespo: sanitarium3: Convert db1102 into a proper multi-instance host [puppet] - 10https://gerrit.wikimedia.org/r/368434 (https://phabricator.wikimedia.org/T169514) [14:49:36] <_joe_> I know nothing about that generator script [14:49:57] I know what happened. I'll need to file a quarry bug [14:49:58] <_joe_> but neither the old or the new guard work [14:50:01] but there was also human error ;) [14:50:15] <_joe_> the new one is particularly easy to spoof [14:50:17] _joe_: tell me about how they don't work? [14:50:30] (It's only a mild security issue since getting the console in the first place also requires prod access) [14:50:38] (03Abandoned) 10Jcrespo: mariadb: Switch db1102 role from sanitarium3->dbstore_multiinstance [puppet] - 10https://gerrit.wikimedia.org/r/363204 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [14:50:52] <_joe_> andrewbogott: if I get this correctly, you want to prevent someone from running that script from a self-hosted puppetmaster? [14:51:04] 10Operations, 10Analytics, 10netops, 10User-Elukey: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3481411 (10elukey) [14:51:07] <_joe_> I'm not sure what's the context, what you're trying to guard against [14:51:21] RECOVERY - Check systemd state on rhenium is OK: OK - running: The system is fully operational [14:51:38] (03CR) 10Giuseppe Lavagetto: [C: 032] Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 (owner: 10Giuseppe Lavagetto) [14:51:51] _joe_: I want one and only one puppetmaster (the one running in prod) to have the store of all the passwords. So once an instance switches to a self-hosted puppetmaster it should stop generating and storing them [14:51:58] oh, hm... [14:52:02] OK attempting again [14:52:23] (03Merged) 10jenkins-bot: Add filters to the future parser [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/367678 (owner: 10Giuseppe Lavagetto) [14:52:28] _joe_: you're right, my new patch is dumb since it means that if the puppetmaster name is changed in hiera the test still passds [14:52:34] so I guess I have to hard-code it [14:52:34] <_joe_> yes [14:52:39] <_joe_> or if you spoof dns [14:53:00] <_joe_> but then people can change the code [14:53:01] <_joe_> :) [14:54:37] <_joe_> so my suggestion was to do something simpler even, without using ipresolve [14:55:00] (03PS1) 10Andrew Bogott: Another change to generation of root passwords [puppet] - 10https://gerrit.wikimedia.org/r/368435 (https://phabricator.wikimedia.org/T171959) [14:56:07] (03CR) 10jerkins-bot: [V: 04-1] Another change to generation of root passwords [puppet] - 10https://gerrit.wikimedia.org/r/368435 (https://phabricator.wikimedia.org/T171959) (owner: 10Andrew Bogott) [14:56:14] _joe_, I feel like this test has been running pretty well. It's at 1/10th of what I think capacity should be and only hitting two nodes directly. What do you think? [14:56:51] ~600 requests per minute. [14:57:10] I want to add the other nodes and try ~2000 requests per minute [14:57:13] <_joe_> servers are unimpressed generally [14:57:17] :) [14:57:23] <_joe_> I'd say go on [14:57:25] OK time for a real test! [14:57:36] * halfak starts taking notes and noting timestamps [14:58:09] wow. 3365*5 scores generates and no timeout errors :) [14:59:39] <_joe_> go on and try harder :P [15:00:00] Here we go! [15:00:18] (03PS2) 10Jcrespo: sanitarium3: Convert db1102 into a proper multi-instance host [puppet] - 10https://gerrit.wikimedia.org/r/368434 (https://phabricator.wikimedia.org/T169514) [15:02:33] Woops. Looks like we had a brief overload event [15:02:39] (03PS1) 10Giuseppe Lavagetto: puppet_compiler: upgrade to 0.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/368437 [15:02:41] (03PS2) 10Andrew Bogott: Disable generation of root passwords for now [puppet] - 10https://gerrit.wikimedia.org/r/368435 (https://phabricator.wikimedia.org/T171289) [15:02:55] <_joe_> halfak: can you define "overload"? [15:02:59] Wait... wasn't an overload. [15:03:08] We set backpressures on our celery queue. [15:03:12] <_joe_> yeah not from the servers' prespective [15:03:16] When it gets too big, we start to return 503s [15:03:21] <_joe_> uhm [15:03:27] <_joe_> so we need to have more celery workers [15:03:33] right. [15:03:47] How's memory usage? [15:03:51] That's our ceiling for workers. [15:04:04] <_joe_> well, if you have a task open for this, add date/times (in UTC) [15:04:06] <_joe_> 65881080 total, 14889904 used, 50991176 free [15:04:11] <_joe_> tons of free memory [15:04:30] <_joe_> that's why I said we will need to tune it [15:04:45] (03CR) 10Andrew Bogott: [C: 032] Disable generation of root passwords for now [puppet] - 10https://gerrit.wikimedia.org/r/368435 (https://phabricator.wikimedia.org/T171289) (owner: 10Andrew Bogott) [15:04:53] Right. We should definitely up that worker count. How do you feel about making changes to these servers directly? [15:05:39] Confirmed that for a brief moment, 1003 returned a 503 because it thought the queue was too big. [15:05:55] <_joe_> halfak: at 5 pm on friday? [15:06:07] <_joe_> uhm lemme find the appropriate gif for that :P [15:06:08] _joe_, good point. probably don't want to do that. [15:06:13] lol [15:06:22] I also asked to wait for a wikidata deployment [15:06:26] in similar terms [15:06:39] <_joe_> jynus: this is out of production atm [15:06:43] ok [15:06:43] <_joe_> so it's less critical [15:06:50] _joe_, I'm OK with calling it right here so you can enjoy your evening. I feel like this was very useful already and I'll be more prepared to try again. [15:07:02] WOuld you be willing to work with me around the same time on Monday? [15:07:10] I could show up an hour earlier too without much pain. [15:07:17] <_joe_> still, if something goes wrong with whatever me and halfak are doing we're not burning down production [15:07:33] right. These servers aren't pooled [15:07:38] No external requests [15:07:42] <_joe_> halfak: heh monday is meetings day, but if you show up at about 14:00Z I have a couple hours [15:07:47] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3481437 (10Andrew) [15:07:52] How about 1300 UTC? [15:07:58] <_joe_> that's even better [15:07:59] <_joe_> :) [15:08:14] <_joe_> if you want to do more tests, please do [15:08:27] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): Switch to new labs puppetmasters - https://phabricator.wikimedia.org/T171786#3476152 (10Andrew) [15:08:32] {{done}}! [15:08:34] <_joe_> I have one thing to finish, then I might be able to look at the data [15:08:50] _joe_, OK! I'll hit it hard before I give up for the day. [15:08:52] <_joe_> what is the task for this load-testing? [15:09:01] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [15:09:58] https://phabricator.wikimedia.org/T169246 [15:10:19] <_joe_> ok, I'll subscribe :) [15:10:53] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3481458 (10Joe) [15:11:34] Triple the rate of request! [15:11:59] (03PS2) 10Giuseppe Lavagetto: puppet_compiler: upgrade to 0.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/368437 [15:12:37] OK Definitely over capacity! [15:12:45] Cool! [15:13:16] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppet_compiler: upgrade to 0.2.2 [puppet] - 10https://gerrit.wikimedia.org/r/368437 (owner: 10Giuseppe Lavagetto) [15:16:01] PROBLEM - salt-minion processes on rhenium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [15:17:12] Trying out batches of 50 [15:17:43] Still getting an overload. I'm surprised. [15:17:57] I suppose the bottleneck continues to be celery. [15:18:05] And batching only affects IO (uwsgi workers) [15:18:25] * halfak talks to himself. [15:18:27] But it helps [15:22:48] (03CR) 10Gehel: "yep, waiting Monday to merge and deploy..." [puppet] - 10https://gerrit.wikimedia.org/r/367709 (https://phabricator.wikimedia.org/T169498) (owner: 10EBernhardson) [15:23:30] !log upgrading and restarting db1102 [15:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:49] <_joe_> halfak: yes, we should raise the number of celery workers [15:23:50] (03PS3) 10Jcrespo: sanitarium3: Convert db1102 into a proper multi-instance host [puppet] - 10https://gerrit.wikimedia.org/r/368434 (https://phabricator.wikimedia.org/T169514) [15:24:03] <_joe_> I'll check our metrics [15:24:40] +1 [15:25:39] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3481494 (10Halfak) I've completed a few tests. TL;DR: we need to up our celery worker count before we'll get an accurate reflection of the cap... [15:25:43] Just about to leave a note with my tests on the phab card. [15:25:43] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481495 (10Jgreen) [15:25:45] (03CR) 10Jcrespo: [C: 032] sanitarium3: Convert db1102 into a proper multi-instance host [puppet] - 10https://gerrit.wikimedia.org/r/368434 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [15:27:10] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481515 (10Jgreen) [15:27:13] 10Operations, 10ops-eqiad, 10netops: eqiad: rack frack refresh equipment - https://phabricator.wikimedia.org/T169644#3481514 (10Jgreen) [15:31:08] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481522 (10ayounsi) We have plenty of ports on the new switches to accommodate that. My suggestion is that we do it after the migration to the new infra (and... [15:37:06] (03PS1) 10Giuseppe Lavagetto: Do not filter catalogs if they have not compiled. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/368438 [15:37:11] RECOVERY - salt-minion processes on rhenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:38:48] Sorry, still having performance issue [15:38:51] In Europe [15:38:52] (03PS2) 10Giuseppe Lavagetto: Do not filter catalogs if they have not compiled. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/368438 [15:39:01] It's taking over 2 mins to load some pages [15:39:10] <_joe_> ShakespeareFan00: care to make an example? [15:39:10] This is NOT acceptable [15:39:25] https://en.wikisource.org/wiki/Page:The_Cutter%27s_Practical_Guide_Part_13.djvu/65 [15:39:29] Took over 2 mins to loaf [15:39:30] *load [15:39:35] <_joe_> this is a djvu single page [15:39:44] or didn't even finish loading [15:39:49] _joe_ : that correct [15:39:56] <_joe_> that loaded in 108 ms for me [15:40:04] Not for me [15:40:09] <_joe_> can you try now? [15:40:42] I can but it's been consistently poor since this morning [15:40:55] <_joe_> so can I have another example? [15:41:00] <_joe_> one that is currently slow [15:41:10] <_joe_> or I cannot really help trying to investigate [15:41:21] PROBLEM - puppet last run on rhenium is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[debdeploy-minion],Package[quickstack] [15:41:25] <_joe_> Is this only happening with wikisource and djvu? [15:41:25] https://en.wikisource.org/wiki/Page:Dictionary_of_National_Biography_volume_26.djvu/16 [15:41:32] #https://en.wikisource.org/wiki/Page:The_Cutter%27s_Practical_Guide_Part_13.djvu/65 [15:41:43] _joe_ : I haven't looked at other wikis yet [15:41:47] <_joe_> loaded both in ~ 110 ms [15:42:00] <_joe_> ShakespeareFan00: that seems like a problem with multi-page djvu files [15:42:01] But yeah [15:42:06] <_joe_> but I'm not sure [15:42:15] having problems with the main page of en.wikipedia.org just now [15:42:26] Images aren't loading - https://en.wikipedia.org/wiki/Main_Page [15:42:33] <_joe_> ok then it's definitely your connection [15:42:35] https://en.wikipedia.org/wiki/Main_Page loads fine for me using bt. [15:43:17] Puzzling [15:43:30] Because I am not having issues with other websites [15:43:33] <_joe_> ShakespeareFan00: can you tell me your IP in private? [15:43:46] IP or iSP? [15:43:50] 10Operations, 10fundraising-tech-ops, 10netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3481550 (10RobH) >>! In T171962#3481522, @ayounsi wrote: > We have plenty of ports on the new switches to accommodate that. My suggestion is that we do it aft... [15:43:51] <_joe_> both :P [15:43:56] <_joe_> ideally [15:44:03] ISP is Voadfone [15:44:08] (ex Demon/Thus) [15:44:19] <_joe_> and a traceroute to en.wikipedia.org [15:44:22] As to the IP I'm not sure as i think it's a pool address which is dynamic [15:44:59] <_joe_> ShakespeareFan00: literally write "whats my ip" in google :) [15:45:12] <_joe_> and share it in private [15:45:15] Tracert isn't givign sensible results [15:46:18] ShakespeareFan00 restart the router :). [15:46:34] Have you checked your local exchange for any maint? [15:46:45] paladox: I haven't [15:46:53] So it could be that [15:46:58] hi [15:47:28] if you can also give the output of a speedtest.net [15:47:46] Try https://www.homeandwork.openreach.co.uk/help-and-support/local-network-status-checker.aspx [15:47:53] No local maintainence that I can obviously find [15:49:13] (03CR) 10Giuseppe Lavagetto: [C: 032] Do not filter catalogs if they have not compiled. [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/368438 (owner: 10Giuseppe Lavagetto) [15:51:19] ShakespeareFan00: https://support.vodafone.co.uk/Vodafone-products-and-services/Broadband/Vodafone-broadband-router/57355761/My-broadband-is-slow-How-can-I-make-it-faster.htm [15:52:10] paladox: usual helpdesk advice I already know and follow [15:52:12] ;) [15:52:19] (And completly useless.) [15:52:20] oh :) [15:53:12] ShakespeareFan00 check your phone line by picking the phone up to see if there's any noise. The lines are known to be slow if you have a phone line fault. [15:53:28] paladox: It's a dedicated line [15:53:32] XD [15:53:38] It shouldn't be "noisy" [15:53:41] what do you mean dedicated? [15:53:54] if there's a line fault there will be noise. [15:54:06] ShakespeareFan00: also https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [15:58:12] ShakespeareFan00 here's some bt things not sure if the bt check your line will work for you but here it is https://www.bt.com/help/home/broadband/speedtest/ (but i hope some of these will work for you as vodaphone uses bts lines) [15:58:16] https://www.bt.com/consumerFaultTracking/public/faults/tracking.do?pageId=31 [15:58:52] ShakespeareFan00 is any of the areas above your area? [15:59:22] Nope [15:59:38] Although problems in London Thamesmead might affect UK connectvity with the rest of the wrold [15:59:44] because of where LINX is [16:00:18] (03PS1) 10Giuseppe Lavagetto: puppet-compiler: bump it up again [puppet] - 10https://gerrit.wikimedia.org/r/368443 [16:01:00] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] puppet-compiler: bump it up again [puppet] - 10https://gerrit.wikimedia.org/r/368443 (owner: 10Giuseppe Lavagetto) [16:01:04] ShakespeareFan00 but then it would affect me too [16:01:07] which it dosen't [16:01:23] try restarting the router. It could be the big green box outside your house. [16:02:35] ShakespeareFan00 do you have fttc, fttp or adsl broaband? [16:02:46] ADSL broadband [16:02:50] As far as I know [16:05:02] that would explain somethings. [16:06:35] 10Operations, 10Puppet, 10Traffic, 10Mobile, and 2 others: URLs with title query string parameter and additional query string parameters do not redirect to mobile site - https://phabricator.wikimedia.org/T154227#3481621 (10Jdlrobson) >>! In T154227#3480505, @Nemo_bis wrote: > Can you guarantee to support a... [16:08:07] ShakespeareFan00 could you run what XioNoX has asked please :) [16:08:31] I don't have curl [16:08:35] (Not a linux user) [16:08:43] ShakespeareFan00 install git [16:08:49] php has curl [16:08:56] Well .... [16:09:03] I am reticent to install anything [16:09:04] speedtest and traceroutes would be a good start [16:09:12] I did a tracert [16:09:19] It didn't show anything obvious [16:09:21] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [16:11:17] ShakespeareFan00 could it be possible vodaphone is throtlling wikipedia? Though bt and other big providers have signed a document saying they wont do that but i am not aware vodaphone did. [16:12:18] ShakespeareFan00 http://speedtest.net could you run that please? [16:13:18] paladox: If they are throttlling it wouldn't suprise me [16:13:50] that would then be the only provider that does [16:14:14] (03PS4) 10Jcrespo: sanitarium3: Convert db1102 into a proper multi-instance host [puppet] - 10https://gerrit.wikimedia.org/r/368434 (https://phabricator.wikimedia.org/T169514) [16:14:45] ShakespeareFan00 does vodaphone do support through twitter? If so they may be able to get an engineer to look into that for you :). [16:15:17] last time I dealt with Voadphone support it was hard to get them to acknoweldge there was an issue [16:16:05] I have a thiery on why you have slow broadband but i have not used adsl in a long while (too slow). It's possible that the thing in the green box is detecting a fault on the line so it increases latency and lower speeds. [16:16:54] oh wait, your not with green box (my mistake) your connected directly to the exchange because you use adsl [16:18:12] (03CR) 10GWicke: [C: 031] JobQueueEventBus: Enable job events in group0 wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368258 (https://phabricator.wikimedia.org/T163380) (owner: 10Ppchelko) [16:18:51] !log apt-get install apache2 on californium for security updates [16:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:26] paladox:http://www.speedtest.net/my-result/6493425327 [16:19:37] !log apt-get install apache2 on silver for security updates [16:19:40] PING 390 ms [16:19:43] 10Operations, 10Pybal, 10Traffic, 10Patch-For-Review: Add support for setting weight=0 when depooling - https://phabricator.wikimedia.org/T86650#3481655 (10ema) p:05Low>03High [16:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:50] that's very high [16:20:04] mine was in the 00 when i had adsl [16:20:09] But the download speed is 9.mbps which is excellent for the UK [16:20:10] 00 -> tens [16:20:31] that's excellent for adsl but i get 60+ on fibre [16:21:22] Your internet is provided by http://demon.net [16:21:37] !log apt-get install apache2 on labcontrol1001 and labcontrol1002 for security updates [16:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:06] 10Operations, 10ops-ulsfo, 10Traffic: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481656 (10RobH) [16:23:34] (03PS1) 10Reception123: Add new mobile watermark for Urdu Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368444 (https://phabricator.wikimedia.org/T171769) [16:24:21] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [16:25:57] 10Operations, 10ops-ulsfo, 10Traffic: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3481679 (10RobH) [16:30:17] (03PS1) 10Cmjohnson: Adding dns entries (mgmt and production) for labstore1006/7 public vlan T167984 [dns] - 10https://gerrit.wikimedia.org/r/368445 [16:30:27] (03CR) 10Reception123: [C: 031] Make wikiquote.png equivalent to enwikiquote.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368244 (https://phabricator.wikimedia.org/T171887) (owner: 10Urbanecm) [16:32:45] (03PS1) 10RobH: install params for cp402[34].ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/368447 (https://phabricator.wikimedia.org/T171966) [16:33:16] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481733 (10RobH) [16:34:40] (03CR) 10RobH: [C: 032] install params for cp402[34].ulsfo.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/368447 (https://phabricator.wikimedia.org/T171966) (owner: 10RobH) [16:38:01] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [16:41:05] (03CR) 10محمد شعیب: [C: 031] Add new mobile watermark for Urdu Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368444 (https://phabricator.wikimedia.org/T171769) (owner: 10Reception123) [16:41:38] (03PS5) 10Jcrespo: sanitarium3: Convert db1102 into a proper multi-instance host [puppet] - 10https://gerrit.wikimedia.org/r/368434 (https://phabricator.wikimedia.org/T169514) [16:46:54] (03PS1) 10Andrew Bogott: labs puppetmaster: validate cert name before autosigning [puppet] - 10https://gerrit.wikimedia.org/r/368449 (https://phabricator.wikimedia.org/T171289) [16:51:05] 10Operations, 10fundraising-tech-ops, 10netops: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3481776 (10ayounsi) [16:52:36] (03PS1) 10Jcrespo: mariadb-multiinstance: Add missing basedir to config [puppet] - 10https://gerrit.wikimedia.org/r/368450 (https://phabricator.wikimedia.org/T169514) [16:54:29] (03CR) 10Jcrespo: [C: 032] mariadb-multiinstance: Add missing basedir to config [puppet] - 10https://gerrit.wikimedia.org/r/368450 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [16:56:55] (03PS2) 10Andrew Bogott: labs puppetmaster: validate cert name before autosigning [puppet] - 10https://gerrit.wikimedia.org/r/368449 (https://phabricator.wikimedia.org/T171961) [16:57:20] (03PS1) 10RobH: fixing cp402[34] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/368453 (https://phabricator.wikimedia.org/T171966) [16:57:48] (03PS2) 10RobH: fixing cp402[34] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/368453 (https://phabricator.wikimedia.org/T171966) [16:59:29] (03CR) 10RobH: [C: 032] fixing cp402[34] mac addresses [puppet] - 10https://gerrit.wikimedia.org/r/368453 (https://phabricator.wikimedia.org/T171966) (owner: 10RobH) [17:03:41] PROBLEM - Check systemd state on db1102 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:07:42] RECOVERY - Check systemd state on db1102 is OK: OK - running: The system is fully operational [17:08:58] (03PS1) 10Andrew Bogott: fullstack: Switch back to the normal schedule pool [puppet] - 10https://gerrit.wikimedia.org/r/368454 [17:09:00] (03PS1) 10Andrew Bogott: nova: add labvirt1016 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/368455 [17:10:07] RECOVERY - mysqld processes on db1102 is OK: PROCS OK: 3 processes with command name mysqld [17:10:07] 10Operations, 10Wikimedia-log-errors: mw1209 /usr/bin/timeout: the monitored command dumped core - https://phabricator.wikimedia.org/T171903#3481878 (10thcipriani) >>! In T171903#3481698, @hashar wrote: > Without knowing the command passed to it, I am not sure how to track the root cause of that. > > ulimit `... [17:11:44] recovery page w/o the down page [17:11:48] huh [17:11:55] (03CR) 10Andrew Bogott: [C: 032] fullstack: Switch back to the normal schedule pool [puppet] - 10https://gerrit.wikimedia.org/r/368454 (owner: 10Andrew Bogott) [17:12:59] apergos: see my comment on the other channel- that was broken for a long time [17:13:12] it took me 3 days to fix it [17:15:42] PROBLEM - MariaDB Slave IO: s2 on db1102 is CRITICAL: CRITICAL slave_io_state could not connect [17:16:22] (03CR) 10Andrew Bogott: [C: 032] nova: add labvirt1016 to the scheduling pool [puppet] - 10https://gerrit.wikimedia.org/r/368455 (owner: 10Andrew Bogott) [17:16:41] PROBLEM - MariaDB Slave IO: s6 on db1102 is CRITICAL: CRITICAL slave_io_state could not connect [17:17:14] ok! [17:17:31] PROBLEM - MariaDB Slave IO: s7 on db1102 is CRITICAL: CRITICAL slave_io_state could not connect [17:18:59] !log cleaned up core files in mw1209:/var/tmp/core to clear disk alert [17:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:01] PROBLEM - MariaDB Slave SQL: s2 on db1102 is CRITICAL: CRITICAL slave_sql_state could not connect [17:22:01] PROBLEM - MariaDB Slave SQL: s6 on db1102 is CRITICAL: CRITICAL slave_sql_state could not connect [17:22:51] PROBLEM - MariaDB Slave SQL: s7 on db1102 is CRITICAL: CRITICAL slave_sql_state could not connect [17:25:06] 10Operations, 10Wikimedia-log-errors: mw1209 /usr/bin/timeout: the monitored command dumped core - https://phabricator.wikimedia.org/T171903#3481957 (10herron) @Joe and I were just looking at this because icinga had fired a disk alert. The 512M /var/cache/hhvm/cli.hhbc.sq3 file has been removed, and /var/tm... [17:25:22] PROBLEM - MariaDB Slave Lag: s2 on db1102 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:26:21] PROBLEM - MariaDB Slave Lag: s6 on db1102 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:27:11] PROBLEM - MariaDB Slave Lag: s7 on db1102 is CRITICAL: CRITICAL slave_sql_lag could not connect [17:30:07] 10Operations, 10Wikimedia-log-errors: mw1209 /usr/bin/timeout: the monitored command dumped core - https://phabricator.wikimedia.org/T171903#3479715 (10MoritzMuehlenhoff) The /var/cache/hhvm/cli.hhbc.sq3 caches were cleared when I upgraded to 3.18, I doubt any of those grew to 512 again. I also created https:... [17:31:54] (03PS1) 10Jcrespo: sanitarium_multiinstance: Enable binlog [puppet] - 10https://gerrit.wikimedia.org/r/368458 (https://phabricator.wikimedia.org/T169514) [17:32:33] that is me, downtime expired- things took more than I expected [17:35:38] (03CR) 10Jcrespo: [C: 032] sanitarium_multiinstance: Enable binlog [puppet] - 10https://gerrit.wikimedia.org/r/368458 (https://phabricator.wikimedia.org/T169514) (owner: 10Jcrespo) [17:36:07] (03PS2) 10Jdlrobson: Add new mobile watermark for Urdu Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368444 (https://phabricator.wikimedia.org/T171769) (owner: 10Reception123) [17:36:09] (03CR) 10Jdlrobson: "PS2 compresses the SVG:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368444 (https://phabricator.wikimedia.org/T171769) (owner: 10Reception123) [17:37:29] (03CR) 10Reception123: [C: 031] Add new mobile watermark for Urdu Wikipedia. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368444 (https://phabricator.wikimedia.org/T171769) (owner: 10Reception123) [17:37:32] (03CR) 10Jdlrobson: Add new mobile watermark for Urdu Wikipedia. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368444 (https://phabricator.wikimedia.org/T171769) (owner: 10Reception123) [17:37:55] (03PS1) 10Jforrester: [DNM] ContInt: Upgrade npm from 2.15.2 to 3.8.3 in CI [puppet] - 10https://gerrit.wikimedia.org/r/368459 (https://phabricator.wikimedia.org/T161861) [17:38:15] (03CR) 10Jdlrobson: [C: 04-1] "Logo is incorrect. @nirzar will provide a more suitable one." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368198 (owner: 10Reception123) [17:39:01] PROBLEM - Host rhenium is DOWN: PING CRITICAL - Packet loss = 100% [17:39:16] (03CR) 10Jforrester: "Not sure if we should go to v3.10.10 (latest release in 3.x); this is the node 6.0 version." [puppet] - 10https://gerrit.wikimedia.org/r/368459 (https://phabricator.wikimedia.org/T161861) (owner: 10Jforrester) [17:40:31] RECOVERY - Host rhenium is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:42:36] (03CR) 10Reception123: "Ok, this is the one that I was given. If you can please also rebase this as I'm not sure why I can't." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368198 (owner: 10Reception123) [17:51:38] (03CR) 10Ladsgroup: [C: 031] "The rebuild is complete now: https://phabricator.wikimedia.org/T171461#3469736 I will get this deployed in SWAT in monday." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367393 (https://phabricator.wikimedia.org/T165197) (owner: 10Ladsgroup) [17:54:52] (03PS1) 10Ottomata: Install virtualenv bin on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/368461 (https://phabricator.wikimedia.org/T152712) [17:55:12] (03CR) 10Ottomata: [V: 032 C: 032] Install virtualenv bin on stat boxes [puppet] - 10https://gerrit.wikimedia.org/r/368461 (https://phabricator.wikimedia.org/T152712) (owner: 10Ottomata) [18:02:01] RECOVERY - MariaDB Slave IO: s2 on db1102 is OK: OK slave_io_state Slave_IO_Running: Yes [18:02:08] (03PS1) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368464 [18:02:11] RECOVERY - MariaDB Slave SQL: s6 on db1102 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:02:12] RECOVERY - MariaDB Slave SQL: s2 on db1102 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:02:21] RECOVERY - MariaDB Slave Lag: s7 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:31] RECOVERY - MariaDB Slave Lag: s6 on db1102 is OK: OK slave_sql_lag Replication lag: 0.23 seconds [18:02:41] RECOVERY - MariaDB Slave Lag: s2 on db1102 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [18:02:42] RECOVERY - MariaDB Slave IO: s7 on db1102 is OK: OK slave_io_state Slave_IO_Running: Yes [18:02:44] finally fixed [18:02:52] RECOVERY - MariaDB Slave IO: s6 on db1102 is OK: OK slave_io_state Slave_IO_Running: Yes [18:03:01] RECOVERY - MariaDB Slave SQL: s7 on db1102 is OK: OK slave_sql_state Slave_SQL_Running: Yes [18:03:03] (03PS2) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368464 [18:05:53] (03CR) 10Bearloga: "@Otto I realize you're super busy with stat1002 stuff but also we need this patch because we're coming up to being a week behind on our me" [puppet] - 10https://gerrit.wikimedia.org/r/367930 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [18:08:00] (03CR) 10Ottomata: [C: 031] "Seems totally fine to me! Thanks so much! If you don't mind, I'll let Gehel merge; (I'm not officially working today :) ), otherwise I c" [puppet] - 10https://gerrit.wikimedia.org/r/367930 (https://phabricator.wikimedia.org/T170494) (owner: 10Bearloga) [18:10:44] (03PS1) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 [18:11:00] (03PS2) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 [18:11:24] (03Abandoned) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368464 (owner: 10Rush) [18:14:08] (03PS3) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 (https://phabricator.wikimedia.org/T171494) [18:14:16] (03PS4) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 (https://phabricator.wikimedia.org/T171494) [18:15:06] (03CR) 10jerkins-bot: [V: 04-1] openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:16:21] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 104, down: 1, dormant: 0, excluded: 3, unused: 0 [18:16:21] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [18:16:31] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [18:17:32] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [18:17:40] please ignore the above ^ [18:18:21] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [18:19:15] (03CR) 10Andrew Bogott: "If the puppet compiler is happy then I'm happy." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/368466 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:19:49] (03PS5) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 (https://phabricator.wikimedia.org/T171494) [18:20:22] RECOVERY - Router interfaces on pfw-eqiad is OK: OK: host 208.80.154.218, interfaces up: 105, down: 0, dormant: 0, excluded: 3, unused: 0 [18:20:24] (03PS6) 10Rush: openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 (https://phabricator.wikimedia.org/T171494) [18:20:53] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[virtualenv] [18:23:24] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review, 10User-Joe: Stress/capacity test new ores* cluster - https://phabricator.wikimedia.org/T169246#3482118 (10Halfak) I scheduled some time with @joe for running another test with more workers on Monday at 1300 UTC. [18:33:02] !log disabling puppet for labs things for trying out refactor rollout [18:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:20] (03CR) 10Rush: [C: 032] openstack: move openstack::repo to new model [puppet] - 10https://gerrit.wikimedia.org/r/368466 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [18:37:53] 10Operations, 10Wikimedia-log-errors: mw1209 /usr/bin/timeout: the monitored command dumped core - https://phabricator.wikimedia.org/T171903#3482149 (10thcipriani) 05Open>03Resolved a:03herron >>! In T171903#3481957, @herron wrote: > @Joe and I were just looking at this because icinga had fired a disk al... [18:38:55] 10Operations, 10Epic, 10Goal, 10Services (later): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3482164 (10Eevans) Regarding space in the cluster: [[ https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-storage?orgId=1 | The dashboard ]] wo... [18:39:30] 10Operations, 10Epic, 10Goal, 10Services (later): End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3482165 (10Eevans) [18:48:10] !log enable and force puppet on labtestservices2001,labtestvirt2001,labtestcontrol2001,labservices1002,labcontrol1002,labnet1002,labvirt1014 and labtestneutron2001 to see a newly installed host get the change instead of a noop [18:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:04] 10Operations, 10Traffic: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3482223 (10RobH) a:05RobH>03None [19:06:35] 10Operations, 10Traffic: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3481656 (10RobH) These two systems are installed and calling into puppet, ready for service implementation. Assigning to @ayounsi but not sure if this should be him or @bblack. [19:09:08] 10Operations, 10ops-ulsfo, 10Traffic: setup/install cp4022 - https://phabricator.wikimedia.org/T171967#3482230 (10RobH) 05Open>03stalled [19:09:10] 10Operations, 10ops-ulsfo, 10Traffic, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3482231 (10RobH) [19:14:31] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:16:53] ^ looking [19:17:31] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:19:22] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3482263 (10GWicke) [19:20:32] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3349120 (10GWicke) [19:26:06] 10Operations, 10ArchCom-RfC, 10Traffic, 10Services (designing): Make API usage limits easier to understand, implement, and more adaptive to varying request costs / concurrency limiting - https://phabricator.wikimedia.org/T167906#3482294 (10GWicke) [19:26:10] 10Operations, 10Traffic: setup/install cp402[34] - https://phabricator.wikimedia.org/T171966#3482295 (10ayounsi) a:03BBlack [19:32:43] (03PS1) 10Thcipriani: Jobrunner: create dsh groups per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/368476 (https://phabricator.wikimedia.org/T129148) [19:37:27] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3482343 (10Pchelolo) [19:40:15] !log releases2001 - OS install worked this time, could not reproduce grub error, signing puppet cert, initial puppet run (T171917) [19:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:27] T171917: setup releases2001.codfw.wmnet - https://phabricator.wikimedia.org/T171917 [19:47:52] (03PS1) 10Dzahn: releases: add releases2001 to site, change rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/368477 (https://phabricator.wikimedia.org/T171917) [19:50:29] (03CR) 10Dzahn: [C: 032] releases: add releases2001 to site, change rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/368477 (https://phabricator.wikimedia.org/T171917) (owner: 10Dzahn) [19:50:31] (03CR) 10Paladox: [C: 031] releases: add releases2001 to site, change rsync direction [puppet] - 10https://gerrit.wikimedia.org/r/368477 (https://phabricator.wikimedia.org/T171917) (owner: 10Dzahn) [19:58:55] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [19:58:55] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0 [20:01:58] (03CR) 10MarcoAurelio: Initial configuration for hiwikiversity (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [20:02:55] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [20:02:56] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 [20:06:18] (03CR) 10MarcoAurelio: Initial configuration for hiwikiversity (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [20:08:47] (03CR) 10Thcipriani: "I described the use-case for this patch in https://phabricator.wikimedia.org/T129148#3482379 but I'm not sure if there's an easier way to " [puppet] - 10https://gerrit.wikimedia.org/r/368476 (https://phabricator.wikimedia.org/T129148) (owner: 10Thcipriani) [20:14:36] PROBLEM - Check systemd state on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:15:26] PROBLEM - Check the NTP synchronisation status of timesyncd on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:16:27] PROBLEM - DPKG on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:17:16] PROBLEM - Disk space on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:56] PROBLEM - configured eth on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:13] that's new but still no reason to do that... [20:19:56] PROBLEM - dhclient process on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:20:28] how do I log a msg again? [20:20:42] ~dumb questions~ [20:20:42] you start the line with !log [20:20:46] PROBLEM - puppet last run on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:20:59] !log removing 2FA from User:SPoore (WMF) [20:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:11] thanks mutante [20:21:14] yw [20:21:38] PROBLEM - salt-minion processes on releases2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:22:16] RECOVERY - Disk space on releases2001 is OK: DISK OK [20:22:17] RECOVERY - DPKG on releases2001 is OK: All packages OK [20:22:19] (03CR) 10MarcoAurelio: "Looks good so far. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368168 (https://phabricator.wikimedia.org/T155038) (owner: 10Urbanecm) [20:22:27] RECOVERY - salt-minion processes on releases2001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [20:22:30] ah, because IPv6 address still had to be added by puppet [20:22:46] RECOVERY - dhclient process on releases2001 is OK: PROCS OK: 0 processes with command name dhclient [20:22:56] RECOVERY - configured eth on releases2001 is OK: OK - interfaces up [20:23:36] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:24:26] RECOVERY - Check systemd state on releases2001 is OK: OK - running: The system is fully operational [20:32:38] MatmaRex: https://gerrit.wikimedia.org/r/#/c/368487/ [20:34:06] (03PS4) 10Dzahn: releases: rsync reprepro data, set active server in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) [20:35:22] (03CR) 10Dzahn: "modified so that we only have hiera lookup in parameter of profile classes, nothing like that in role class" [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [20:38:03] (03PS2) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [20:45:18] RECOVERY - Check the NTP synchronisation status of timesyncd on releases2001 is OK: OK: synced at Fri 2017-07-28 20:45:13 UTC. [20:51:53] Krinkle: i honestly know nothing about that stuff but i can +2 if you want me to [20:52:00] (03PS3) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [20:54:36] (03CR) 10Urbanecm: Initial configuration for hiwikiversity (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/368165 (https://phabricator.wikimedia.org/T168765) (owner: 10Urbanecm) [20:59:11] (03PS4) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [21:03:37] (03PS5) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [21:06:47] MatmaRex: Aye, that'd be nice [21:08:56] Thanks [21:10:30] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/7209/releases1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/368333 (https://phabricator.wikimedia.org/T164030) (owner: 10Dzahn) [21:14:40] 10Operations: librenms - syslog stopped working after migration - https://phabricator.wikimedia.org/T172008#3482759 (10Dzahn) [21:14:48] 10Operations: librenms - syslog stopped working after migration - https://phabricator.wikimedia.org/T172008#3482774 (10Dzahn) p:05Triage>03High [21:40:54] (03PS6) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [21:40:56] I'm checking logstash for some errors and found lots of errors like this: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.07.28/apache2?id=AV2KcTDcCtkHCY6a_HUI&_g=() [21:41:05] " AH01067: Failed to read FastCGI header" [21:41:19] Is it normal? Just wanted to give the heads up [21:42:27] 10Operations, 10RESTBase, 10RESTBase-API, 10Traffic, 10Services (next): RESTBase support for www.wikimedia.org missing - https://phabricator.wikimedia.org/T133178#3482880 (10GWicke) >>! In T133178#3428811, @Krinkle wrote: > I'd recommend the latter, but not indefinitely. We'd deprecate REST on `wikimedia... [21:48:20] Another fantastic error: https://su.wikipedia.org/w/index.php?title=Propinsi_Gifu&action=info [21:54:02] Amir1_, fatal error: Argument 1 passed to MediaWiki\Linker\LinkRenderer::makeLink() must implement interface MediaWiki\Linker\LinkTarget, null given in /srv/mediawiki/php-1.30.0-wmf.11/includes/actions/InfoAction.php on line 240 [21:54:14] MaxSem: https://phabricator.wikimedia.org/T172016 [21:54:19] just made the bug [21:54:39] The page is not redirect but ActionInfo thinks so and tries to load redirect target [21:55:58] it checks for $title->isRedirect() [21:56:44] so we have a discrepancy somewhere [22:01:41] (03PS7) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [22:02:39] (03PS8) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [22:04:29] (03CR) 10Krinkle: [C: 04-1] Fix exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/249905 (owner: 10MaxSem) [22:14:07] (03PS9) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [22:18:42] (03PS10) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [22:19:33] (03PS11) 10Chad: WIP: moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [22:22:18] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:23:39] (03Abandoned) 10MaxSem: Fix exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/249905 (owner: 10MaxSem) [22:26:50] (03PS12) 10Chad: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [22:30:32] (03PS13) 10Chad: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [22:32:42] (03PS1) 10MaxSem: logging: Remove exceptionmonitor [puppet] - 10https://gerrit.wikimedia.org/r/368522 [22:42:56] (03PS1) 10Rush: openstack: move rabbitmq to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/368523 (https://phabricator.wikimedia.org/T171494) [22:44:43] (03PS2) 10Rush: openstack: move rabbitmq to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/368523 (https://phabricator.wikimedia.org/T171494) [22:47:01] (03PS3) 10Rush: wip openstack: move rabbitmq to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/368523 (https://phabricator.wikimedia.org/T171494) [22:53:28] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [22:56:54] (03PS1) 10Dzahn: admins::dzahn: export reprepro base dir based on hostname [puppet] - 10https://gerrit.wikimedia.org/r/368524 [22:58:10] (03PS2) 10Dzahn: admins::dzahn: export reprepro base dir based on hostname [puppet] - 10https://gerrit.wikimedia.org/r/368524 [22:59:07] (03CR) 10Dzahn: [C: 032] admins::dzahn: export reprepro base dir based on hostname [puppet] - 10https://gerrit.wikimedia.org/r/368524 (owner: 10Dzahn) [23:01:32] 10Operations, 10Release-Engineering-Team, 10vm-requests, 10Security-General: New ganeti VM for MW release pipeline work - https://phabricator.wikimedia.org/T163743#3483047 (10Dzahn) [23:01:35] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3483046 (10Dzahn) 05Open>03Resolved [23:03:00] 10Operations, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3483092 (10Dzahn) [23:03:03] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10Security-General: setup releases1001.eqiad.wmnet (was: setup mwreleases1001) - https://phabricator.wikimedia.org/T164030#3218909 (10Dzahn) [23:03:06] 10Operations, 10vm-requests, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): setup releases2001.codfw.wmnet - https://phabricator.wikimedia.org/T171917#3483090 (10Dzahn) 05Open>03Resolved [23:06:01] (03PS1) 10Dzahn: cache::misc: release: add codfw backend, make active-active [puppet] - 10https://gerrit.wikimedia.org/r/368527 (https://phabricator.wikimedia.org/T171917) [23:07:00] (03PS2) 10Dzahn: cache::misc: releases: add codfw backend, make active-active [puppet] - 10https://gerrit.wikimedia.org/r/368527 (https://phabricator.wikimedia.org/T171917) [23:10:38] (03PS3) 10Dzahn: cache::misc: releases: add codfw backend, make active-active [puppet] - 10https://gerrit.wikimedia.org/r/368527 (https://phabricator.wikimedia.org/T171917) [23:10:49] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone on stretch: Unable to locate package geoipupdate - https://phabricator.wikimedia.org/T171916#3483101 (10bd808) Discussed a bit on irc with @faidon. The recommended short term fix is to use jessie instead of stretch. The next tier of fix is for us to fix op... [23:11:16] 10Puppet, 10Cloud-VPS: role::puppetmaster::standalone on stretch: Unable to locate package geoipupdate - https://phabricator.wikimedia.org/T171916#3483104 (10bd808) p:05Triage>03Normal [23:17:49] (03PS4) 10Rush: openstack: move rabbitmq to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/368523 (https://phabricator.wikimedia.org/T171494) [23:17:54] (03Abandoned) 10Rush: labtest: labcontrol2001 use rabbitmq role [puppet] - 10https://gerrit.wikimedia.org/r/366166 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [23:18:39] (03PS5) 10Rush: openstack: move rabbitmq to module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/368523 (https://phabricator.wikimedia.org/T171494) [23:25:46] (03CR) 10Dzahn: [C: 032] cache::misc: releases: add codfw backend, make active-active [puppet] - 10https://gerrit.wikimedia.org/r/368527 (https://phabricator.wikimedia.org/T171917) (owner: 10Dzahn) [23:32:41] !log puppetmaster2001 - git pulled in /var/lib/git/operations/puppet to sync with puppetmaster1001 - accidentally interrupted puppet-merge [23:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:45] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#3483199 (10Dzahn) [23:45:17] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Provide cross-dc redundancy (active-active or active-passive) to all important misc services - https://phabricator.wikimedia.org/T156937#2990470 (10Dzahn)