[00:01:51] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Mails through deployment-mx SPF & DKIM fails - https://phabricator.wikimedia.org/T87338#4274226 (10Krenair) Gmail is now showing, with that cherry-picked: SPF: PASS with IP 208.80.155.138 Learn more DKIM: 'PASS' with domain beta.wmflabs.org Lea... [00:18:03] 10Operations, 10ops-codfw, 10Traffic: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4274234 (10Papaul) [00:35:12] !log remove non-deployers from wmf-deployment Gerrit group (T196959) [00:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:40:11] (03PS2) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [00:41:00] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [00:42:17] (03PS1) 10Papaul: DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) [00:45:12] 10Operations, 10ops-codfw, 10Traffic, 10Patch-For-Review: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560#4274283 (10Papaul) [00:45:50] (03PS3) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [00:46:10] (03PS4) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [00:46:49] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [00:47:17] (03PS5) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [00:49:16] (03CR) 10Paladox: "This will be used to add http://gerrit.wmfusercontent.org in a seperate commit which will then be used to supply avatars in gerrit." [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [00:49:48] (03PS1) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808 [00:50:16] (03PS2) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808 [00:50:31] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439808 (owner: 10Paladox) [00:54:44] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [00:56:55] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [01:54:02] (03PS1) 10Papaul: DHCP: Add MAC address and netboot entries for backup2001 [puppet] - 10https://gerrit.wikimedia.org/r/439830 (https://phabricator.wikimedia.org/T196477) [02:02:52] (03PS2) 10Papaul: DNS: Add mgmt DNS entries for bast2002 (supposed to be in public VLAN) [dns] - 10https://gerrit.wikimedia.org/r/439786 (https://phabricator.wikimedia.org/T196665) [02:35:35] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.7) (duration: 14m 10s) [02:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:53] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Tue Jun 12 02:45:53 UTC 2018 (duration 10m 18s) [02:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:13] (03PS8) 10KartikMistry: apertium-apy: New upstream release [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/433318 (https://phabricator.wikimedia.org/T194342) [04:00:15] (03PS2) 10KartikMistry: Update apertium-apy initscripts [puppet] - 10https://gerrit.wikimedia.org/r/438135 (https://phabricator.wikimedia.org/T194342) [05:01:05] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:04:05] (03PS1) 10Marostegui: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316) [05:04:25] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [05:06:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:08:12] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:08:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439835 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:08:56] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4274454 (10Marostegui) Not sure if the above actions by @mmodell should have shown any changes on the write patterns, but so far, they remain the same https://grafana.wikimedi... [05:09:30] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1091 for alter table (duration: 00m 52s) [05:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:41] !log Deploy schema change on db1091 T191316 T192926 T89737 T195193 [05:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:48] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:09:48] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:09:48] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:09:48] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:19:25] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [05:22:45] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:46:53] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274466 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054... [05:46:56] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274467 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ``` [05:47:58] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274468 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054... [05:48:00] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274469 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ``` [05:48:09] <_joe_> this thing really doesn't work [05:48:44] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274470 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054... [05:48:46] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274471 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ``` [05:49:39] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by oblivian on neodymium.eqiad.wmnet for hosts: ``` mw1230.eqiad.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/20180612054... [05:50:18] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4274473 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1230.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1230.eqiad.wmnet'] ``` [05:50:42] <_joe_> ok this is really really frustrating, I'll reimage that host by hand [05:51:12] <_joe_> (╯°□°)╯︵ ┻━┻ [05:54:33] what errors do you get?? [05:58:48] PROBLEM - mcrouter process on mw2235 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter [06:03:03] <_joe_> this is me ^^ [06:03:10] <_joe_> I'm doing some further tests [06:03:30] <_joe_> elukey: whatever, I don [06:03:40] <_joe_> 't have time for broken processes and broken docs [06:10:08] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.001 second response time [06:16:18] RECOVERY - mcrouter process on mw2235 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [06:25:18] RECOVERY - configured eth on mw1230 is OK: OK - interfaces up [06:25:19] RECOVERY - dhclient process on mw1230 is OK: PROCS OK: 0 processes with command name dhclient [06:25:28] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837 [06:25:28] RECOVERY - MD RAID on mw1230 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [06:25:39] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 6.56, 4.28, 2.28 [06:25:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1230 is OK: OK ferm input default policy is set [06:25:58] RECOVERY - Disk space on mw1230 is OK: DISK OK [06:25:59] RECOVERY - HHVM processes on mw1230 is OK: PROCS OK: 6 processes with command name hhvm [06:26:09] RECOVERY - mcrouter process on mw1230 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [06:26:09] RECOVERY - DPKG on mw1230 is OK: All packages OK [06:26:18] RECOVERY - Check size of conntrack table on mw1230 is OK: OK: nf_conntrack is 0 % full [06:27:19] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 76410 bytes in 8.063 second response time [06:27:28] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 2.424 second response time [06:27:29] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837 (owner: 10Marostegui) [06:29:01] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837 (owner: 10Marostegui) [06:29:14] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1091" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439837 (owner: 10Marostegui) [06:30:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1091 after alter table (duration: 00m 51s) [06:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:20] 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4274480 (10ArielGlenn) [06:30:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316) [06:31:29] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/nova/policy.json] [06:31:40] !log Stop replication on db1095, db1102, db1125 to change triggers - T192926 [06:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:45] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [06:34:56] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:36:27] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:37:36] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 for alter table (duration: 00m 50s) [06:37:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:56] !log Deploy schema change on db1121 with replication, this will generate lag on labsdb:s4 T191316 T192926 T89737 T195193 [06:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:02] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [06:38:03] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [06:38:03] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [06:38:03] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [06:38:32] (03PS2) 10Dzahn: Remove /xhprof from performance.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/439647 (https://phabricator.wikimedia.org/T196406) (owner: 10Imarlier) [06:38:54] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439838 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [06:38:59] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1230 is OK: OK: synced at Tue 2018-06-12 06:38:53 UTC. [06:49:09] (03CR) 10Dzahn: [C: 032] Remove /xhprof from performance.wikimedia.org apache config [puppet] - 10https://gerrit.wikimedia.org/r/439647 (https://phabricator.wikimedia.org/T196406) (owner: 10Imarlier) [06:49:20] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [06:52:39] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:53:30] RECOVERY - Long running screen/tmux on mw1230 is OK: OK: No SCREEN or tmux processes detected. [06:54:10] RECOVERY - IPMI Sensor Status on mw1230 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK [06:55:39] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:49] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:07] (03CR) 10Muehlenhoff: [C: 031] "Two nits, looks good to me!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:11:25] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [07:11:34] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4274554 (10Lea_WMDE) [07:11:42] (03CR) 10Jcrespo: [C: 031] "This looks good to me, let me know when to merge it and how to test it to validate it." [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper) [07:12:25] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [07:12:34] (03CR) 10Dzahn: [C: 032] phabricator weekly project changes email: Add mysql slave port parameter [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper) [07:12:39] (03PS2) 10Dzahn: phabricator weekly project changes email: Add mysql slave port parameter [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper) [07:12:57] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4081970 (10Lea_WMDE) [07:13:08] (03CR) 10Muehlenhoff: [C: 031] "One nit, looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:15:07] (03CR) 10Muehlenhoff: [C: 031] debmonitor: client side setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:17:51] (03CR) 10Dzahn: [C: 032] "tested by running /usr/local/bin/project_changes.sh on phab1001.eqiad.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/439581 (https://phabricator.wikimedia.org/T196604) (owner: 10Aklapper) [07:17:55] PROBLEM - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% [07:24:15] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:24:16] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I don't think this is the approach we should take if we want to make all those files templates. I even tried going this way in the past an" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [07:24:25] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6 [07:24:26] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:24:35] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6 [07:24:35] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:24:36] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:24:36] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:24:36] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 [07:24:46] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6 [07:24:46] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6 [07:24:55] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 [07:24:55] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 [07:25:05] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:25:05] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 [07:25:05] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:25:05] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6 [07:25:05] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6 [07:25:15] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:25:15] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6 [07:25:15] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:25:15] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6 [07:25:25] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 [07:25:25] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp3037_v4, cp3037_v6 [07:25:25] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 [07:25:26] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6 [07:25:35] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 65 not-conn: cp3037_v6 [07:25:45] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 78 not-conn: cp3037_v4, cp3037_v6 [07:26:21] I'll take a look at cp3037 [07:26:36] I cannot reach it over the network nor over the management interface [07:27:20] hah! might be dead in the water [07:28:44] vgutierrez: I take it you'll keep on looking/followup? to avoid both working on it [07:28:59] I'm trying to :) [07:30:26] 10Operations, 10Wikimedia-Apache-configuration: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968#4274566 (10Joe) [07:30:59] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "I created a task about my plans here https://phabricator.wikimedia.org/T196968" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [07:31:16] !log closing idle screen session on tin (about to be decomed, dont use anymore) [07:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:38] vgutierrez: interesting enough cp3037 doesn't have an icinga check for the mgmt interface... [07:32:06] volans: also on librenms is showing traffic in real time... [07:32:13] https://librenms.wikimedia.org/device/device=138/tab=port/port=10861/view=realtime/ [07:32:13] all the sourrounding cp30* have it ofc [07:32:32] that... or the port is mislabeled :/ [07:34:41] so, the check was removed in the last run of puppet on icinga [07:34:59] I'll check that part [07:36:01] (03PS6) 10Volans: debmonitor: client side setup [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) [07:36:03] (03PS6) 10Volans: debmonitor: install debmonitor-client [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) [07:36:17] (03CR) 10Volans: "Thanks for the review, replies inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:36:31] (03CR) 10Volans: "Thanks for the review, replies inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:37:17] (03CR) 10Volans: "The full compiler (with PS5) is available at:" [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:38:41] ipmitool "chassis status" is also failing for cp3037 [07:39:29] (03CR) 10Muehlenhoff: [C: 031] debmonitor: client side setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:39:38] right... xe-3/0/04 labeled as cp3037 is actually cp3036 [07:39:50] how weird that the check for mgmt is gone [07:39:53] (03CR) 10Muehlenhoff: [C: 031] debmonitor: client side setup [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:39:56] *xe-3/0/4 sorry [07:40:00] yeah, I'm trying to understand why [07:40:49] i can confirm it and there seems nothing that explains it in site.pp.. same role as others and that gets added deep in base.pp [07:41:03] it's not in the last compiled catalog of that host [07:41:15] !log ganeti2003 reboot for microcode update [07:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:24] so seems that the resourse was not exported by puppet,, but why? [07:42:17] a serial console session on cp3036.mgmt actually connects to cp3036 at least# [07:43:03] it might be explained by: [07:43:04] $facts['has_ipmi'] and $facts['ipmi_lan'] and 'ipaddress' in $facts['ipmi_lan'] [07:43:05] PROBLEM - Host ganeti2003 is DOWN: PING CRITICAL - Packet loss = 100% [07:43:12] the mgmt checks are if guarded by the above [07:43:35] oh, that seems like it can explain it.. when DRAC breaks? [07:43:35] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [07:43:49] heh, does that mean it's like "only if DRAC works, then check it" [07:43:52] !log ganeti2007 reboot for microcode update [07:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:23] yeah, ipmi_lan fact is missing [07:44:31] i feel this will end in reseating DRAC and then it's back :p [07:44:46] isn't it embedded ? [07:44:54] it's not a dedicated card, is it ? [07:44:55] RECOVERY - Host ganeti2003 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [07:45:01] then replacin the board [07:45:05] lol [07:45:15] well, drain flea power first I guess [07:45:22] basically cp3036 and cp3037 labels are switched on asw-esams.. so xe-3/0/5 is cp3037 and the port is reported as physically up but no traffic there ofc [07:46:05] PROBLEM - Host ganeti2007 is DOWN: PING CRITICAL - Packet loss = 100% [07:46:13] <_joe_> uh? [07:46:17] <_joe_> oh ok [07:46:29] nothing to see here, move along [07:46:47] <_joe_> yeah I read the DOWN and then read backscroll [07:46:47] maybe that if-guard should have an else-branch that says "WARN - no IPMI IP" [07:46:56] so yeah, for me 'bmc-config -o -S Lan_Conf' failed / didn't return valid data, and the fact was not populated, hence the resourse was not exported [07:47:33] mutante: yeah, but we need then to if guard everything with an is_virtual [07:47:35] RECOVERY - Host ganeti2007 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [07:47:58] also, puppet defines a check, not it's return value ;) [07:48:01] it's a bit tricky [07:48:06] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842 [07:48:31] yep, node [07:48:34] *nod* [07:48:37] vgutierrez: fwiw at the end of april its remote ipmi was working (I've done an audit) [07:50:01] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842 (owner: 10Marostegui) [07:51:12] from bast3002, at least the mgmt interface is reachable aka 3-way handshake but I'm not able to get a proper ssh session there [07:51:45] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842 (owner: 10Marostegui) [07:52:29] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439842 (owner: 10Marostegui) [07:53:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 after alter table (duration: 00m 50s) [07:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:48] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4274603 (10fgiunchedi) Yeah I think it might have been the controller barfing and the disk is actually ok. I couldn't find related logs on lithium tho so hard to know for sure. The disk can be sent back, we'll o... [08:03:08] !log Deploy schema change on s1 codfw primary master (db2048) with replication, this will generate lag on codfw T191316 T192926 T89737 T195193 [08:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:14] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [08:03:15] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [08:03:15] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [08:03:15] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [08:04:29] !log ganeti2006 reboot for microcode update [08:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:25] PROBLEM - Host ganeti2006 is DOWN: PING CRITICAL - Packet loss = 100% [08:07:25] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:07:26] PROBLEM - etcd request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:08:15] RECOVERY - Host ganeti2006 is UP: PING OK - Packet loss = 0%, RTA = 36.12 ms [08:08:52] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4274635 (10Lea_WMDE) [08:09:41] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 2 others: Deploying FileExporter and FileImporter - https://phabricator.wikimedia.org/T190716#4081970 (10Lea_WMDE) [08:09:45] RECOVERY - etcd request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:10:13] (03CR) 10Volans: "Compiler is still happy: https://puppet-compiler.wmflabs.org/compiler02/11449/" [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [08:10:40] (03CR) 10Filippo Giunchedi: "LGTM, see nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/432528 (https://phabricator.wikimedia.org/T182085) (owner: 1020after4) [08:10:45] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:15:19] !log ganeti2002 reboot for microcode update [08:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:45] PROBLEM - Host ganeti2002 is DOWN: PING CRITICAL - Packet loss = 100% [08:15:56] 10Operations, 10WMDE-QWERTY-Team, 10wikidiff2, 10Patch-For-Review: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4274651 (10Lea_WMDE) [08:18:05] RECOVERY - Host ganeti2002 is UP: PING OK - Packet loss = 0%, RTA = 36.46 ms [08:18:47] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.140 second response time [08:21:46] 10Operations, 10ops-esams, 10netops: cp3036 and cp3037 production ports mislabeled - https://phabricator.wikimedia.org/T196970#4274656 (10Vgutierrez) [08:22:30] (03PS2) 10Dvorapa: toollabs: install python{,3}-pymysql on exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [08:23:50] fwiw most tools workers seem fine at a kubectl get nodes -o wide [08:24:04] checker is probably having a hiccup [08:25:04] (03CR) 10Alex Monk: "So if I understand correctly what you're saying is that having puppet generate files this size through templates slows it to a crawl and t" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [08:25:32] (03PS2) 10Marostegui: mariadb: Promote db1066 to master [puppet] - 10https://gerrit.wikimedia.org/r/439530 (https://phabricator.wikimedia.org/T194870) [08:25:40] (03PS2) 10Marostegui: db-eqiad.php: Set s2 as read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439531 (https://phabricator.wikimedia.org/T194870) [08:25:47] (03PS2) 10Marostegui: db-eqiad.php: Promote db1066 to master and remove read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439532 (https://phabricator.wikimedia.org/T194870) [08:25:54] (03PS2) 10Marostegui: wmnet: Update s2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/439533 (https://phabricator.wikimedia.org/T194870) [08:26:03] (03PS2) 10Alexandros Kosiaris: kubernetes: Remove deprecated --api-servers parameter [puppet] - 10https://gerrit.wikimedia.org/r/436483 [08:26:22] (03CR) 10Volans: [C: 032] debmonitor: client side setup [puppet] - 10https://gerrit.wikimedia.org/r/439580 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [08:27:11] yeah [08:27:35] (03PS4) 10Giuseppe Lavagetto: systemd: add define specific to timers [puppet] - 10https://gerrit.wikimedia.org/r/417948 [08:28:11] not sure how checker generates that info [08:29:29] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "> So if I understand correctly what you're saying is that having" [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [08:30:24] (03PS1) 10Volans: debmonitor: fix directory creation [puppet] - 10https://gerrit.wikimedia.org/r/439849 [08:30:25] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:30:45] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:30:59] it's me.... sorry [08:31:03] (03CR) 10Muehlenhoff: [C: 031] debmonitor: fix directory creation [puppet] - 10https://gerrit.wikimedia.org/r/439849 (owner: 10Volans) [08:31:05] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:14] (03CR) 10Volans: [C: 032] debmonitor: fix directory creation [puppet] - 10https://gerrit.wikimedia.org/r/439849 (owner: 10Volans) [08:31:15] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:16] PROBLEM - puppet last run on elastic2035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:25] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:25] PROBLEM - puppet last run on wtp2005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:35] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:36] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:45] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:55] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:55] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:31:55] fixing [08:31:56] PROBLEM - puppet last run on ms-be1028 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:32:05] PROBLEM - puppet last run on elastic2030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:32:05] PROBLEM - puppet last run on elastic1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:32:06] PROBLEM - puppet last run on kubestagetcd1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:32:06] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:32:06] PROBLEM - puppet last run on wtp2007 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:32:10] I'm shutting up icinga-wm [08:32:16] PROBLEM - puppet last run on mw1275 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/debmonitor/ssl] [08:32:17] thx [08:32:35] volans: should I take the opportunity to also reboot puppetdb for the spec-ctrl thing ? [08:32:42] go for it! [08:32:43] :D [08:33:36] !log reboot puppetdb1001 for spec-ctrl enable. Bundling it with a minor puppet outage to only have a torrent of harmless puppet failures once [08:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:51] akosiaris: let me know when done, so I do only one run of cumin run puppet on failed ones [08:35:03] arturo: btw some worker nodes are cordoned. I guess you are aware, just mentioning [08:35:06] !log ema@neodymium conftool action : set/pooled=no; selector: name=cp3037.esams.wmnet [08:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:24] volans: up and running [08:35:36] akosiaris: ack, thanks [08:35:52] akosiaris: actually I don't know why [08:35:58] !log rebalance ganeti codfw cluster [08:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:06] !log running puppet on failed hosts post small puppet outage and puppetdb reboot [08:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:12] and now the hopefully final round of VM reboots [08:39:33] (03PS1) 10Volans: debmonitor: fix newlines in conf file [puppet] - 10https://gerrit.wikimedia.org/r/439852 [08:41:52] (03CR) 10Volans: [C: 032] "Now it's correct:" [puppet] - 10https://gerrit.wikimedia.org/r/439852 (owner: 10Volans) [08:43:11] arturo: I am gonna merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/436483/. It should be noop as I 've already tested it [08:43:29] (03PS3) 10Alexandros Kosiaris: kubernetes: Remove deprecated --api-servers parameter [puppet] - 10https://gerrit.wikimedia.org/r/436483 [08:43:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] kubernetes: Remove deprecated --api-servers parameter [puppet] - 10https://gerrit.wikimedia.org/r/436483 (owner: 10Alexandros Kosiaris) [08:46:16] akosiaris: ok [08:48:02] !log Stop replication on db2094 to change triggers for archive table [08:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:48] arturo: I've noticed (from a puppetcompiler failure) that in the labs/privare repo the wmcs/monitoring/wmcs_monitoring_rsync key is missing. I was about to add it as snakeoil, but double checking with you in case a real one is needed there [08:53:29] *labs/private [08:54:33] volans: yeah, probably just the actual private exists, and not in labs/private [08:57:25] yes, it's in the real private one and not in the 'public' private [08:58:16] RECOVERY - puppet last run on aqs1006 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:58:25] RECOVERY - puppet last run on restbase-dev1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:58:25] RECOVERY - puppet last run on mw2253 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:58:26] RECOVERY - puppet last run on wtp1042 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:58:46] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:59:05] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [08:59:14] the puppet run has completed, all good [08:59:46] RECOVERY - puppet last run on ms-be1043 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:00:05] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:00:26] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [09:00:36] akosiaris: FYI puppet fails on kubernetes2003.codfw.wmnet [09:00:48] Systemd start for docker failed! [09:01:14] volans: yeah ignore it [09:01:22] ack [09:01:22] I am still fighting with the imaging process [09:03:46] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:04:05] (03PS1) 10Volans: Add missing wmcs/monitoring dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/439856 [09:04:08] (03PS6) 10Alexandros Kosiaris: Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) [09:04:10] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add the nodes for the proton service [puppet] - 10https://gerrit.wikimedia.org/r/437995 (https://phabricator.wikimedia.org/T186748) (owner: 10Alexandros Kosiaris) [09:04:20] arturo: ^^^ [09:04:58] volans: but that kubernetes has nothing to do with toolforge, right? [09:05:05] my patch [09:05:21] oh, I ingore `wikibugs` :-P [09:05:24] ignore* [09:05:30] ahhhh [09:05:31] :D [09:05:36] https://gerrit.wikimedia.org/r/439856 [09:06:07] (03CR) 10Arturo Borrero Gonzalez: [C: 032] Add missing wmcs/monitoring dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/439856 (owner: 10Volans) [09:06:14] volans: +2 [09:06:25] (03CR) 10Volans: [V: 032] Add missing wmcs/monitoring dummy keys [labs/private] - 10https://gerrit.wikimedia.org/r/439856 (owner: 10Volans) [09:06:31] ack done :) [09:07:16] thanks volans ! [09:07:26] yw [09:07:54] (03PS7) 10Volans: debmonitor: install debmonitor-client [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) [09:08:15] (03PS10) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [09:08:17] (03PS6) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [09:08:30] (03PS2) 10Gehel: elasticsearch: enable G1 garbage collector [puppet] - 10https://gerrit.wikimedia.org/r/437231 (https://phabricator.wikimedia.org/T156137) [09:10:51] (03CR) 10DCausse: [C: 031] elasticsearch: enable G1 garbage collector [puppet] - 10https://gerrit.wikimedia.org/r/437231 (https://phabricator.wikimedia.org/T156137) (owner: 10Gehel) [09:12:11] (03CR) 10Gehel: [C: 032] elasticsearch: enable G1 garbage collector [puppet] - 10https://gerrit.wikimedia.org/r/437231 (https://phabricator.wikimedia.org/T156137) (owner: 10Gehel) [09:12:52] (03PS11) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [09:12:54] (03PS7) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [09:12:56] (03PS1) 10Alexandros Kosiaris: conftool: Add the mathoid service to kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/439857 [09:14:37] (03PS8) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [09:15:35] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:16:42] the icinga-wm bot left the #wikimedia-cloud-feed channel, how can I tell it to rejoin? [09:17:42] (03CR) 10Volans: "Latest compiler results: https://puppet-compiler.wmflabs.org/compiler02/11452/" [puppet] - 10https://gerrit.wikimedia.org/r/439641 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [09:18:19] arturo: mmmh checking, it was restarted by puppet, so should have re-joined all channels [09:18:52] volans: oh sorry it actually rejoined. irccloud wasn't clear about that :-P [09:19:02] ah ok, that makes sense [09:19:38] I didn't get the recovery message from the toolforge k8s thing [09:19:44] because the bot left [09:19:46] but is now ok [09:19:54] (03PS9) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [09:21:00] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:29:45] RECOVERY - DPKG on multatuli is OK: All packages OK [09:30:32] (03PS1) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 [09:30:46] PROBLEM - ircecho bot process on kraz is CRITICAL: PROCS CRITICAL: 0 processes with command name python, regex args /usr/local/bin/udpmxircecho.py [09:31:19] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo) [09:31:48] (03PS2) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 [09:31:55] RECOVERY - ircecho bot process on kraz is OK: PROCS OK: 1 process with command name python, regex args /usr/local/bin/udpmxircecho.py [09:32:32] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo) [09:32:46] (03PS3) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 [09:33:16] 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10Vgutierrez) [09:33:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo) [09:34:06] (03PS4) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 [09:34:41] (03PS5) 10Jcrespo: mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 [09:35:23] 10Operations, 10ops-esams, 10Traffic: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974#4274802 (10Vgutierrez) p:05Triage>03Normal [09:36:25] ACKNOWLEDGEMENT - Host cp3037 is DOWN: PING CRITICAL - Packet loss = 100% Vgutierrez T196974 [09:37:29] (03CR) 10Jcrespo: [C: 032] mariadb: Introduce replication password on mariadb root clients [puppet] - 10https://gerrit.wikimedia.org/r/439860 (owner: 10Jcrespo) [09:41:55] !log cp3037 has been depooled due to unknown hardware issues T196974 [09:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:01] T196974: cp3037 is currently unreachable - https://phabricator.wikimedia.org/T196974 [09:44:36] RECOVERY - nutcracker process on mw1230 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [09:44:45] RECOVERY - nutcracker port on mw1230 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:44:48] (03PS1) 10Volans: debmonitor: fine tune nginx fail_timeout [puppet] - 10https://gerrit.wikimedia.org/r/439865 (https://phabricator.wikimedia.org/T191299) [09:44:56] RECOVERY - Check systemd state on mw1230 is OK: OK - running: The system is fully operational [09:46:12] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4274837 (10Lea_WMDE) [09:50:35] (03CR) 10Muehlenhoff: [C: 031] "Looks fine, we can do a real world test when debmonitor us run the first time on trusty." [puppet] - 10https://gerrit.wikimedia.org/r/439865 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:50:55] (03CR) 10Volans: "Two nit/questions inline" (032 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [09:51:19] (03CR) 10Volans: [C: 032] debmonitor: fine tune nginx fail_timeout [puppet] - 10https://gerrit.wikimedia.org/r/439865 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:01:36] (03PS10) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [10:02:35] PROBLEM - Host mwdebug2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:56] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [10:03:15] RECOVERY - Host mwdebug2002 is UP: PING OK - Packet loss = 0%, RTA = 36.32 ms [10:05:20] ACKNOWLEDGEMENT - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974 [10:05:20] ACKNOWLEDGEMENT - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974 [10:05:20] ACKNOWLEDGEMENT - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974 [10:05:20] ACKNOWLEDGEMENT - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974 [10:05:20] ACKNOWLEDGEMENT - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974 [10:05:20] ACKNOWLEDGEMENT - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 connecting: cp3037_v4, cp3037_v6 Ema cp3037 down T196974 [10:06:32] 10Operations, 10LDAP: Update certificates on productions replicas of corp.wikimedia.org LDAP - https://phabricator.wikimedia.org/T168460#4274898 (10Aklapper) a:05bbogaert>03None [10:14:25] PROBLEM - DPKG on multatuli is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:16:55] PROBLEM - proton endpoints health on proton1001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [10:17:09] multatuli it mor.itz and me playing with debmonitor [10:18:49] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4274935 (10Lea_WMDE) [10:19:06] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy [10:19:32] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10Lea_WMDE) [10:20:55] (03PS1) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [10:21:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [10:21:25] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [10:21:34] !log bounce stuck rsyslog on lithium / wezen - T136312 [10:21:36] that's me ^ [10:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:39] T136312: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312 [10:21:45] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1229 days) [10:21:51] (03PS3) 10Dvorapa: toollabs: install python{,3}-pymysql on exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [10:24:55] 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4274967 (10awight) Wikidata wouldn't survive a year of this upper-bound unscalability. It has received 200M edits in the past 12 months, so we would have... [10:25:56] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10WMDE-Fisch) [10:26:25] !log setting expire_log_days on db1066 as 30 [10:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:55] PROBLEM - Check Varnish expiry mailbox lag on cp3046 is CRITICAL: CRITICAL: expiry mailbox lag is 2014133 [10:29:05] PROBLEM - puppet last run on cp5008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:29:05] PROBLEM - puppet last run on mw2183 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:29:15] PROBLEM - puppet last run on mw2248 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:29:26] PROBLEM - puppet last run on labtestvirt2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:16] PROBLEM - puppet last run on db2075 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:45] PROBLEM - puppet last run on db2050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:30:45] PROBLEM - puppet last run on dbstore2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:31:12] taking a look, puppetdb perhaps [10:31:58] indeed [10:32:06] PROBLEM - puppet last run on maps2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:09] Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to execute '/pdb/cmd/v1?checksum=5457825afc630dada2b6fbdbd3395d5b61c3ff12&version=5&certname=dbstore2001.codfw.wmnet&command=replace_facts&producer-timestamp=1528799187' on at least 1 of the following 'server_urls': https://puppetdb2001.codfw.wmnet [10:32:25] PROBLEM - puppet last run on mc2031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:26] PROBLEM - puppet last run on mw2195 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:36] PROBLEM - puppet last run on cp4021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:32:45] PROBLEM - puppet last run on elastic2017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:05] PROBLEM - puppet last run on elastic2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:14] should be recovering [10:33:15] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:35] PROBLEM - puppet last run on wtp2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:35] PROBLEM - puppet last run on ms-be2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:33:45] PROBLEM - puppet last run on mc2035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:35:21] 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission db1056 - https://phabricator.wikimedia.org/T193736#4275004 (10Marostegui) [10:35:23] probably the ganeti restarts [10:35:37] akosiaris: was puppetdb2001 also in the loop for restarts? [10:39:42] it was in the list of hosts needing a reboot at least [10:42:44] ack [10:43:53] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10thiemowmde) I'm afraid I did not fully understood what "linking to test wiki" means? Should https://test.wikipe... [10:44:25] RECOVERY - puppet last run on mw2183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:45:54] (03PS11) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [10:47:17] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [10:55:37] (03PS1) 10WMDE-Fisch: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) [10:55:39] (03PS1) 10WMDE-Fisch: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) [10:57:36] RECOVERY - puppet last run on maps2002 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:57:55] RECOVERY - puppet last run on mc2031 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:58:05] RECOVERY - puppet last run on mw2195 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [10:58:15] RECOVERY - puppet last run on cp4021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:58:16] RECOVERY - puppet last run on elastic2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:58:35] RECOVERY - puppet last run on elastic2012 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:58:36] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:58:56] RECOVERY - puppet last run on wtp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:05] RECOVERY - puppet last run on ms-be2032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:59:15] RECOVERY - puppet last run on mc2035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:59:36] RECOVERY - puppet last run on cp5008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:59:46] RECOVERY - puppet last run on mw2248 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:00:06] RECOVERY - puppet last run on labtestvirt2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:00:55] RECOVERY - puppet last run on db2075 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:00:58] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4275065 (10WMDE-Fisch) >>! In T195370#4275029, @thiemowmde wrote: > I'm afraid I did not fully understood what "linking to... [11:01:16] RECOVERY - puppet last run on db2050 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:01:16] RECOVERY - puppet last run on dbstore2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:02:35] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [11:02:57] (03PS2) 10WMDE-Fisch: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) [11:03:45] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [11:06:45] 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4275076 (10awight) Some negatives to the per-page approach: * Slightly incompatible with ORES, which is per-revision. For example, fetching an ORES+JADE... [11:30:36] 10Operations, 10cloud-services-team, 10Patch-For-Review: cloud vps: disable system-wide apt pinning for OpenStack jessie hosts - https://phabricator.wikimedia.org/T196659#4275100 (10aborrero) I tried generating an apt pinning file containing the dependencies of keystone which are present in jessie-backports... [11:32:10] (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: use install_options to install from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439589 (https://phabricator.wikimedia.org/T196633) [11:39:35] RECOVERY - Check Varnish expiry mailbox lag on cp3046 is OK: OK: expiry mailbox lag is 131634 [11:40:09] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "According to the compiler, this should be fine:" [puppet] - 10https://gerrit.wikimedia.org/r/439589 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:46:49] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) [11:47:27] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:47:48] !log updated component/cassandra311 on apt.wikimedia.org to 3.11.2 [11:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:57] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) [11:50:04] 10Operations, 10Cassandra, 10User-Eevans: Add Cassandra 3.11.2 package to internal APT repository - https://phabricator.wikimedia.org/T196745#4275154 (10MoritzMuehlenhoff) 05Open>03Resolved Imported via Secure Apt (release key is signed by Eric with whom I've signed keys) and added to component/cassandra... [11:50:38] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:52:32] (03PS3) 10Arturo Borrero Gonzalez: openstack: keystone: fix syntax for install_options [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) [11:53:44] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Puppet compiler is rather good:" [puppet] - 10https://gerrit.wikimedia.org/r/439884 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [11:54:45] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [11:54:56] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1229 days) [11:57:23] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:58:15] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:58:22] 10Operations, 10cloud-services-team, 10Patch-For-Review: cloud vps: disable system-wide apt pinning for OpenStack jessie hosts - https://phabricator.wikimedia.org/T196659#4275185 (10aborrero) Finally, the `-t jessie-backports` thing went really smooth. Puppet output: {P7248} [11:58:35] 10Operations, 10cloud-services-team, 10Patch-For-Review: cloud vps: disable system-wide apt pinning for OpenStack jessie hosts - https://phabricator.wikimedia.org/T196659#4275186 (10aborrero) 05Open>03Resolved a:03aborrero [11:58:45] PROBLEM - Check systemd state on db1068 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:58:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 031] Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [11:59:57] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4275191 (10Paladox) phabricator is going to parse the existing refs/changes/*/*/meta commits (no new ones will be added to the queue so this will eventually go down). According... [12:00:26] RECOVERY - puppet last run on labcontrol1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:01:12] 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4275206 (10fgiunchedi) p:05Normal>03High Thanks @Cmjohnson ! Please treat this with urgency, do you know if there's an ETA? If more than a couple of days I'll remove the... [12:01:42] moritzm: all VMs rebooted (once more). I think (hope actually) we are finally OK [12:01:51] volans: yeah it was as moritzm pointed out [12:02:01] np [12:02:03] thx [12:02:32] I might break conftool btw. I am merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439857/ [12:02:35] (03PS2) 10Alexandros Kosiaris: conftool: Add the mathoid service to kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/439857 [12:02:37] (03PS12) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [12:02:39] (03PS8) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [12:03:42] (03CR) 10Alexandros Kosiaris: [C: 032] conftool: Add the mathoid service to kubernetes cluster [puppet] - 10https://gerrit.wikimedia.org/r/439857 (owner: 10Alexandros Kosiaris) [12:03:46] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 1229 days) [12:05:13] akosiaris: thanks! I've just doublechecked via cumin; all ganeti instances are running an IBPB-enabled kernel [12:05:47] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: dc=.*,service=mathoid,cluster=kubernetes,name=.* [12:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:02] yay [12:09:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) [12:10:51] 10Operations, 10monitoring, 10Patch-For-Review, 10User-fgiunchedi: Encrypt syslog traffic - https://phabricator.wikimedia.org/T136312#4275228 (10fgiunchedi) Latest rsyslog release containing the fix is already packaged in Debian unstable, it'd be easier to backport that to stretch instead of jessie. Once w... [12:11:07] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:11:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:11:28] !log Deploy schema change on dbstore1002:s1 T191316 T192926 T89737 T195193 [12:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:35] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [12:11:35] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [12:11:35] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [12:11:35] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [12:12:45] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:13:02] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1099:3311 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439887 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [12:14:00] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1099:3311 for alter table (duration: 00m 52s) [12:14:02] !log Deploy schema change on db1099:3311 T191316 T192926 T89737 T195193 [12:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:15] (03PS1) 10Arturo Borrero Gonzalez: openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633) [12:17:52] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:18:59] (03PS2) 10Arturo Borrero Gonzalez: openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633) [12:19:53] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: keystone: also install python-routes from jessie-backports [puppet] - 10https://gerrit.wikimedia.org/r/439888 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:22:05] PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:25] (03PS1) 10Paladox: Copy wikimedia-polygerrit-style.html to static/gerrit-theme.html [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439889 [12:22:44] (03PS2) 10Paladox: Copy wikimedia-polygerrit-style.html to static/gerrit-theme.html [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439889 [12:22:45] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:23:04] (03CR) 10Paladox: "This change is ready for review." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439889 (owner: 10Paladox) [12:23:05] RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15502 bytes in 3.212 second response time [12:24:07] I still cannot access mailman, can you? [12:24:48] (03PS1) 10Paladox: Copy GerritSite.css and GerritSiteHeader.html from puppet repo [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890 [12:25:24] (03PS2) 10Paladox: Copy GerritSite.css and GerritSiteHeader.html from puppet repo [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890 [12:25:39] (03CR) 10Paladox: "This change is ready for review." [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/439890 (owner: 10Paladox) [12:26:37] it worked finally [12:26:50] maybe it got overloaded after starting? [12:27:21] oh, actually it wasn't restarted, so it is something else [12:28:11] spikes of load in the last 3 days [12:28:37] lots of listinfo processes [12:28:45] I will check for a ticket and file one CC herron akosiaris [12:29:05] i.e. /var/lib/mailman/scripts/driver listinfo [12:29:17] I thought it was a host restart [12:29:22] that is I wasn't too worried [12:29:41] uptime is five days [12:29:51] yeah, I notice that only recently [12:29:56] *ced [12:30:27] it is again unavailable to me [12:31:05] PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:31:33] per prometheus there was a similar spike (also load of 120) yesterday at 5:30 [12:31:57] yes, and 2 and 3 days ago [12:32:06] RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15502 bytes in 5.301 second response time [12:32:45] I saw no ongoing ticket, will create one [12:33:39] (03PS1) 10Arturo Borrero Gonzalez: hieradata: openstack: eqiad1: actually use a false value for keystone daemon [puppet] - 10https://gerrit.wikimedia.org/r/439891 (https://phabricator.wikimedia.org/T196633) [12:33:55] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [12:34:06] <_joe_> !log repooling mw1230 after reimaging T196881 [12:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:11] T196881: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881 [12:34:21] hmm maybe spam [12:34:29] 10Operations, 10ops-eqiad, 10DC-Ops: Replace disk on mw1230 - https://phabricator.wikimedia.org/T196881#4275315 (10Joe) 05Open>03Resolved a:03Joe [12:36:05] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275321 (10jcrespo) [12:36:21] nothing odd in mailman logs AFAICT (they're fairly noisy as plenty of (abandoned?) lists are repeatedly logged [12:36:39] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275331 (10jcrespo) [12:36:41] https://grafana.wikimedia.org/dashboard/db/mail?refresh=5m&orgId=1&from=now-24h&to=now doesn't point to any spam spike [12:37:10] (03CR) 10Arturo Borrero Gonzalez: [C: 032] "Puppet compiler is good:" [puppet] - 10https://gerrit.wikimedia.org/r/439891 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:37:17] what is that listinfo thing ? [12:37:27] I think the first thing is to know if http requests hanging is a cause or a consequence [12:37:59] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275334 (10MoritzMuehlenhoff) Load was in the 120 ballpark and there were total of 141 "/usr/bin/python -S /var/lib/mailman/scripts/driver listinfo" processes running. [12:38:32] this last time seems more sustained [12:38:38] !log cp3035: restart varnish-be, mbox lag [12:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:12] seems to be CGI which does "Produce listinfo page, primary web entry-point to mailing lists" [12:40:29] so e.g. http://lists.wikimedia.org/mailman/listinfo/betacluster-alerts would call it I guess [12:40:59] I think I have the culprit [12:41:26] please share, or fix it first and then share :-) [12:41:45] PROBLEM - mailman archives on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:42:55] RECOVERY - mailman archives on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 73975 bytes in 8.734 second response time [12:43:02] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275349 (10jcrespo) [12:43:25] I 've banned a very specific IP [12:43:25] PROBLEM - mailman list info on fermium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:10] doesn't look like it helped though [12:45:25] I 've stopped apache and everything has subsided ... [12:45:32] so this is HTTP requests related [12:45:46] RECOVERY - Check Varnish expiry mailbox lag on cp3035 is OK: OK: expiry mailbox lag is 0 [12:45:46] (03PS1) 10Arturo Borrero Gonzalez: openstack: base: keystone service requires false as boolean [puppet] - 10https://gerrit.wikimedia.org/r/439892 (https://phabricator.wikimedia.org/T196633) [12:45:53] but for how long? [12:46:02] if mailman overloads itself [12:46:03] probably not for long [12:46:16] so this is not mailman overloading itself [12:46:20] it's someone external overloading it [12:46:24] and I 've already banned an IP [12:46:27] (03PS1) 10Giuseppe Lavagetto: mediawiki: add vhost define [puppet] - 10https://gerrit.wikimedia.org/r/439893 (https://phabricator.wikimedia.org/T196968) [12:46:29] (03PS1) 10Giuseppe Lavagetto: mediawiki::web::beta_sites: convert wikibooks to vhost [puppet] - 10https://gerrit.wikimedia.org/r/439894 (https://phabricator.wikimedia.org/T196968) [12:46:36] RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15500 bytes in 0.102 second response time [12:46:59] (03CR) 10Arturo Borrero Gonzalez: [C: 032] openstack: base: keystone service requires false as boolean [puppet] - 10https://gerrit.wikimedia.org/r/439892 (https://phabricator.wikimedia.org/T196633) (owner: 10Arturo Borrero Gonzalez) [12:47:13] yeah the listinfo process have subsided very much [12:47:20] I see only like a few now [12:47:26] <_joe_> elukey, Krenair https://gerrit.wikimedia.org/r/439893 and the followup, I'd like your opinion [12:47:39] <_joe_> basically my idea is to convert all sites to use that define [12:49:41] that IP you dropped made nearly 1300 requests today, maybe that fixed it, but the backlog is so large that we hadn't seen recovering effects yet [12:50:04] I will leave the topic [12:50:16] and the ticket for longer term analysis [12:50:22] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275360 (10akosiaris) I 've banned a specific IP (I 'll share it in a private paste later on), restarted apache and everything seems to be ok now [12:50:25] is herron mostly working on email? [12:50:56] as in, is he the right person to take that or someone else? [12:51:47] I see the load going back up again [12:51:49] _joe_ seems a nice idea! [12:52:31] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275364 (10akosiaris) P7249 for the list of IPs [12:53:28] yeah, it is going to fail again [12:53:55] PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 2065091 [12:54:06] _joe_ assuming of course that the vhosts will have the same structure in the future (I think this is the case since they haven't checked a lot) [12:54:13] but +1 from me, no concerns [12:54:23] I also like the clarity of the define in the puppet config [12:54:42] I was wondering if mod_macro could have been used instead but probably too messy [12:57:47] (03PS1) 10Paladox: Planet: Set xmlmaxarticles to 100 in rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/439897 [12:58:19] (03PS2) 10Paladox: Planet: Set xmlmaxarticles to 100 in rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/439897 [12:58:37] (03PS3) 10Paladox: Planet: Set xmlmaxarticles to 100 in rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965) [12:59:29] (03PS4) 10Paladox: Planet: Set xmlmaxarticles to 100 in config [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965) [13:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1300). [13:00:05] No GERRIT patches in the queue for this window AFAICS. [13:00:16] 10Operations, 10monitoring, 10User-fgiunchedi: Open Phab tasks on SMART failure - https://phabricator.wikimedia.org/T196994#4275410 (10fgiunchedi) p:05Triage>03Normal [13:00:18] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965) (owner: 10Paladox) [13:00:28] nice, no patches, no swat ;) [13:00:50] (03CR) 10Dzahn: [C: 032] Planet: Set xmlmaxarticles to 100 in config [puppet] - 10https://gerrit.wikimedia.org/r/439897 (https://phabricator.wikimedia.org/T196965) (owner: 10Paladox) [13:04:04] zeljkof: woo no swat patches [13:04:06] CFisch_remote: around? [13:04:15] jouncebot: next [13:04:15] In 0 hour(s) and 55 minute(s): FileImporter and FileExporter in group0 (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1400) [13:04:29] addshore: jepp [13:04:49] but a bit distracted [13:04:54] zeljkof: I'll start my next window now then as there is nothing in swat, and the first patch requires a fill sync, (YAY) [13:05:08] addshore: go ahead :D [13:05:09] CFisch_remote: is the patch on the branch ready? :) [13:05:29] nope I wanted to prepare that just before 4pm [13:05:34] but we can do it now [13:05:40] ack :) [13:07:32] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FileExporter/+/439900/ [13:07:42] (03CR) 10Hashar: [C: 031] Gerrit: Make PolyGerrit the default ui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [13:08:15] RECOVERY - mediawiki-installation DSH group on mw1230 is OK: OK [13:08:43] (03CR) 10Vgutierrez: [C: 031] "nitpick & inline doubt, but it's looking good :D" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/417948 (owner: 10Giuseppe Lavagetto) [13:09:53] (03CR) 10Paladox: Gerrit: Make PolyGerrit the default ui (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [13:11:24] addshore: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FileExporter/+/439900/ [13:12:29] 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4275456 (10Cmjohnson) @fgiunchedi I submitted a ticket with HP. I recommend removing the server from swift until it's fixed since I do not know what it's going to take to f... [13:12:51] (03PS3) 10Filippo Giunchedi: prometheus: use validate_cmd for rules and config files [puppet] - 10https://gerrit.wikimedia.org/r/432074 [13:13:35] (03PS6) 10Elukey: Move the varnishkafka submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437467 (https://phabricator.wikimedia.org/T188377) [13:13:37] (03PS2) 10Elukey: Move the kafkatee submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437950 (https://phabricator.wikimedia.org/T188377) [13:13:39] (03PS2) 10Elukey: Move the jmxtrans submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/437951 (https://phabricator.wikimedia.org/T188377) [13:13:41] (03PS1) 10Elukey: Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/439901 (https://phabricator.wikimedia.org/T188377) [13:14:03] nope --^ didn't work [13:14:03] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: use validate_cmd for rules and config files [puppet] - 10https://gerrit.wikimedia.org/r/432074 (owner: 10Filippo Giunchedi) [13:15:02] (03CR) 10jerkins-bot: [V: 04-1] Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/439901 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey) [13:15:33] 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4275468 (10awight) In the per-page schema proposed above, the page-revision index would grow at the scary rate, up to one index entry per revision added t... [13:16:09] !log installing openjdk-8 security updates on restbase-dev along with cassandra restarts [13:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:24] (03Abandoned) 10Elukey: Move the nginx submodule to operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/439901 (https://phabricator.wikimedia.org/T188377) (owner: 10Elukey) [13:21:29] CFisch_remote: cool! [13:21:57] (03PS3) 10Filippo Giunchedi: prometheus: alert on config reload failure [puppet] - 10https://gerrit.wikimedia.org/r/432059 [13:24:18] mhh fermium still with its cpu pegged, taking a look [13:24:44] (03PS2) 10Ema: varnish: Remove setting of CP cookies [puppet] - 10https://gerrit.wikimedia.org/r/437774 (https://phabricator.wikimedia.org/T110353) (owner: 10Krinkle) [13:24:45] CFisch_remote: apparently my internet is gone... [13:24:46] godog: https://phabricator.wikimedia.org/T196989#4275364 [13:25:05] Just waiting for it to come back.. [13:25:52] (03CR) 10Zhuyifei1999: "@Dvorapa Please don't bother rebasing patches in ops/puppet, unless it cannot be auto-rebased (conflict). It will be rebased by the person" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [13:26:23] addshore: ^^' [13:26:27] (03CR) 10Ema: [C: 032] varnish: Remove setting of CP cookies [puppet] - 10https://gerrit.wikimedia.org/r/437774 (https://phabricator.wikimedia.org/T110353) (owner: 10Krinkle) [13:26:43] right, merging https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/FileExporter/+/439900/ on the .7 branch [13:26:47] *waits for CI* [13:28:01] (03CR) 10Dvorapa: "> @Dvorapa Please don't bother rebasing patches in ops/puppet, unless" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [13:28:44] akosiaris: thanks! yeah looks like more offenders, load at 100+ [13:29:14] (03CR) 10Dvorapa: "Also sorry for some unrelated test accounts, I've overclicked" [puppet] - 10https://gerrit.wikimedia.org/r/416874 (https://phabricator.wikimedia.org/T189052) (owner: 10Zhuyifei1999) [13:30:06] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200) [13:30:42] (03PS1) 10Gehel: maps: upgrade to cassandra-2.2.6-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/439905 (https://phabricator.wikimedia.org/T196044) [13:31:15] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [13:31:49] (03CR) 10Gehel: [C: 032] maps: upgrade to cassandra-2.2.6-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/439905 (https://phabricator.wikimedia.org/T196044) (owner: 10Gehel) [13:32:46] Hi, can you deploy https://gerrit.wikimedia.org/r/#/c/436211/ [13:32:47] in this case it'd be also nice if we could ask mod_cgi to always limits its concurrency heh [13:32:47] Thanks! [13:33:00] CFisch_remote: looks merged to me [13:33:54] addshore: lets assume its merged then ;-) [13:33:56] (03PS5) 10Zoranzoki21: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) [13:36:04] CFisch_remote: right, pulled onto tin, and now pulled onto mwdebug1002 [13:36:10] *checks nothing is somehow broken* [13:37:54] I mean in theory nothing of this should be loaded atm [13:37:54] (03PS2) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [13:37:57] 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4275526 (10Gehel) cassandra-2.2.6-wmf5 deployed on maps-test2004, it seems to work just fine. [13:37:59] but you never know [13:38:22] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [13:38:31] !log addshore@deploy1001 Started scap: [[gerrit:439900|FileExporter backport]] - Pre deployment backport (extension not yet deployed) [13:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:37] CFisch_remote: ^^ [13:39:09] affirmative [13:39:11] addshore: Sorry, can you deploy https://gerrit.wikimedia.org/r/#/c/436211/? [13:39:21] (03PS4) 10Hoo man: Support prefixed dump types [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [13:39:37] 10Operations, 10Deployments, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Addshore: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030#4275527 (10Addshore) Just got this while syncing: ``` 13:38:32 Sta... [13:40:07] sorry Zoranzoki21, as it wasn't in the calendar I have started something else, and the current sync will take ~45 mins [13:40:29] addshore: Ok, I can add for next swat? [13:40:33] yup [13:40:40] (03CR) 10ArielGlenn: [C: 032] Support prefixed dump types [puppet] - 10https://gerrit.wikimedia.org/r/424291 (https://phabricator.wikimedia.org/T163328) (owner: 10Lokal Profil) [13:41:00] addshore: tnx [13:47:27] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275321 (10fgiunchedi) Looks like high load is back with a whole lot of `listinfo` requests [13:49:27] (03PS3) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [13:49:54] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [13:54:04] (03PS1) 10Volans: Drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) [13:54:16] (03PS1) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) [13:55:14] (03PS2) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) [13:55:59] (03CR) 10BBlack: [C: 031] vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema) [13:56:23] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [13:58:49] (03PS1) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) [14:00:04] addshore and CFisch_WMDE: Dear deployers, time to do the FileImporter and FileExporter in group0 deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1400). [14:00:26] O/ [14:00:36] (03PS2) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) [14:00:52] CFisch_remote: internet just dropped again... [14:01:00] oh man [14:01:02] Or, DNS did. Mhmpf [14:01:25] at least you do not need to have a connection all the time for things to run ^^ [14:01:32] (03PS1) 10Paladox: planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913 [14:03:08] (03PS2) 10Paladox: planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913 [14:03:35] (03PS3) 10Paladox: planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913 [14:04:06] (03CR) 10Dzahn: [C: 032] planet: Add labs common.yaml file to add hiera keys for labs only [puppet] - 10https://gerrit.wikimedia.org/r/439913 (owner: 10Paladox) [14:04:19] (03PS4) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [14:04:34] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [14:06:17] (03PS1) 10Paladox: planet: Add meta link to labs hiera value [puppet] - 10https://gerrit.wikimedia.org/r/439914 [14:06:25] CFisch_remote: yup, woo for screen! [14:06:34] (03PS3) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) [14:06:36] (03PS2) 10Paladox: planet: Add meta link to labs hiera value [puppet] - 10https://gerrit.wikimedia.org/r/439914 [14:06:44] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439914 (owner: 10Paladox) [14:07:58] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275640 (10akosiaris) Yeah, found something new, I 've reblocked some stuff, I 'll update P7249. Things do look normal again, this might just... [14:08:12] (03CR) 10Dzahn: [C: 032] planet: Add meta link to labs hiera value [puppet] - 10https://gerrit.wikimedia.org/r/439914 (owner: 10Paladox) [14:09:09] !log addshore@deploy1001 Finished scap: [[gerrit:439900|FileExporter backport]] - Pre deployment backport (extension not yet deployed) (duration: 30m 37s) [14:09:13] CFisch_remote: ^^ [14:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:18] only 30 mins now, woo [14:09:33] right, now onto the config I understand CFisch_remote ? [14:10:07] yes so next the config [14:10:15] the upper one first [14:10:21] and the the other [14:10:52] the upper one? :P [14:10:53] the last one is where it get's interesting and things can explode :-D [14:10:59] on the calendar ?:) [14:11:04] yep [14:11:33] or https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/439875/ to be precise [14:13:00] any reason you chose to put the test wikis that way around? (test2 being a source and test being a target)? just curious :D [14:13:53] addshore: when looking on the upload pages you will see that test2 has a super big warning to not upload things there [14:14:03] ooooh, cool [14:14:15] (03PS6) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [14:14:15] ( but still the form is shown and people do it ) [14:14:26] so we thought it might be better to have it that way [14:14:27] (03PS2) 10Addshore: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:14:33] (03PS3) 10Addshore: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:15:01] sounds like a good reason to me [14:15:29] (03PS3) 10Paladox: Gerrit: Hook up gerrit.wmfusercontent.org to alias [puppet] - 10https://gerrit.wikimedia.org/r/439808 [14:15:31] (03CR) 10Addshore: [C: 032] Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:16:17] (03PS4) 10Ema: vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) [14:16:41] (03PS1) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) [14:16:51] (03CR) 10Ema: [C: 032] vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema) [14:17:02] (03Merged) 10jenkins-bot: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:17:08] 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4275648 (10fgiunchedi) @Cmjohnson ok! thanks, I'll being removing the machine from swift tomorrow [14:17:44] (03PS3) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) [14:18:21] (03CR) 10jenkins-bot: Allow setting of export target for FileExporter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439875 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:18:28] CFisch_remote: first patch is on mwdebug1002, *checks the world is still there* [14:18:50] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [14:19:17] :-) [14:20:10] syncing patch #1 [14:20:55] !log addshore@deploy1001 Synchronized wmf-config/CommonSettings.php: FileImporter/Exporter [[gerrit:439875|Allow setting of export target for FileExporter]] T195370 (duration: 00m 50s) [14:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:59] T195370: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370 [14:21:07] (03CR) 10Addshore: [C: 032] Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:21:48] CFisch_remote: and in goes patch #2 [14:22:06] CFisch_remote: i guess you should be able to kind of fully test this while on mwdebug1002? :) [14:22:27] (03Merged) 10jenkins-bot: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:22:35] (03PS2) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) [14:22:54] addshore: I hope so [14:22:58] (03CR) 10jenkins-bot: Enable FileExporter and FileImporter on group0 with test setup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439876 (https://phabricator.wikimedia.org/T195370) (owner: 10WMDE-Fisch) [14:23:01] I have the tabs open [14:23:08] CFisch_remote: it is done [14:23:13] mwdebug1002 that is [14:23:15] (03CR) 10Alexandros Kosiaris: [C: 031] mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:23:28] you should totally update the author part too! im currently the only person listed there :O [14:23:36] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:41] * CFisch_remote checks [14:24:24] beta feature is there [14:24:28] link text is in [14:24:45] (03CR) 10Herron: "worth mentioning that this will add some dns lookup overhead to requests matching the REQUEST_URL. hits are cached." [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:24:47] (03CR) 10Filippo Giunchedi: [C: 031] mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:25:04] CFisch_remote: https://usercontent.irccloud-cdn.com/file/qU92P8Ol/image.png [14:25:17] o.O [14:25:21] I get "File uploads are not available on this wiki. If you have a legitimate need to test uploading, local bureaucrats can assign you the relevant right. " [14:25:42] oooh, you dont have the ability to upload files? :P [14:25:45] let me give you a flag [14:26:03] but that's a problem [14:26:10] so users cannot really test this [14:26:10] oh noes =o [14:26:22] why do I see the upload form but then do not have the rights to upload [14:26:23] ahhh [14:26:30] that's stupid [14:26:40] damn [14:26:47] so, autoconfirmed users should be able to upload [14:26:48] hmm ... [14:27:04] you just must not be autoconfirmed on testwiki [14:27:06] (03CR) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:27:14] whats your username? [14:27:14] maybe [14:27:24] Christoph Jauera (WMDE) [14:27:55] https://usercontent.irccloud-cdn.com/file/Vr2WQnKY/image.png [14:27:56] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [14:28:17] there must be another right needed for uploading that isnt in the group [14:28:22] damn so it's something different [14:28:23] yeah [14:28:43] and Lea is in a meeting and I can't reach her ... [14:28:52] hard to say what we should do now [14:28:58] (03PS13) 10Alexandros Kosiaris: mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 [14:29:00] (03PS9) 10Alexandros Kosiaris: lvs: Add the proton lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/437997 (https://phabricator.wikimedia.org/T186748) [14:29:21] CFisch_remote: do uselang=qqx, what is the message key for that message? [14:30:24] (03CR) 10Alexandros Kosiaris: [C: 032] mathoid: Use the kubernetes LVS cluster explictly [puppet] - 10https://gerrit.wikimedia.org/r/437254 (owner: 10Alexandros Kosiaris) [14:30:31] addshore: hard to say it comes after the post when I try to upload for real [14:30:38] aaah lame [14:30:47] and thats strange because we do the upload check at the beginning [14:30:47] I think thats an on wiki override *looks for it* [14:30:57] it must be triggered somewhere "inside" [14:31:13] CFisch_remote: its abusefilter [14:31:14] :D [14:31:18] https://test.wikipedia.org/w/index.php?search=File+uploads+are+not+available+on+this+wiki&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%22namespaces%22%3A%5B%228%22%5D%7D&ns8=1&searchToken=2rnymx7bzcm0ly5x604x0pc5d [14:31:21] wtf :-D [14:31:25] nice [14:31:37] *looks at abusefilter* [14:32:00] (03PS2) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) [14:32:02] we could lower that rule for the test phase then [14:32:20] CFisch_remote: https://test.wikipedia.org/w/index.php?title=Special:AbuseLog&wpSearchFilter=160 [14:32:27] CFisch_remote: https://test.wikipedia.org/wiki/Special:AbuseFilter/160 [14:32:32] requires autopatrol or reviewer [14:33:17] (03CR) 10Herron: [C: 032] mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:33:21] (03PS3) 10Herron: mailman: add per IP rate limit of 50 requests per 5 min [puppet] - 10https://gerrit.wikimedia.org/r/439912 (https://phabricator.wikimedia.org/T196989) [14:33:55] addshore: do you have rights to change that filter - so we temporarily disable it for the group0 test phase? [14:34:01] I do, hmm [14:34:16] ( it should be 2 weeks I think ) [14:34:19] so, at what point do you hit that? once on the preview page and youve made changes etc? [14:34:40] (03CR) 10Dzahn: "this results in a line "ServerAlias" that isn't followed by an alias. Not sure if Apache will hate this." [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [14:34:44] after the preview page [14:34:49] CFisch_remote: I'll disable it now so you can try again [14:34:53] when you press upload [14:35:05] (03CR) 10Ottomata: "Yeehaw, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema) [14:35:43] CFisch_remote: try again? :) [14:35:46] addshore: "This action has been automatically identified as harmful, and therefore disallowed. If you believe your action was constructive, please inform an administrator of what you were trying to do. A brief description of the abuse rule which your action matched is: Mass upload stop " [14:35:48] oh wait i failed [14:35:51] next filter ^^ [14:36:06] man that sucks :-) [14:36:10] CFisch_remote: try now :) [14:36:23] \o/ [14:36:25] worked [14:36:28] okay [14:36:35] right, im gonna do the rest of the sync then [14:36:47] phew nice, thank you so much addshore [14:37:16] syncing [14:37:47] (03CR) 10Paladox: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [14:38:05] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: FileImporter/Exporter [[gerrit:439876|Enable FileExporter/Importer on group0 wikis]] T195370 (duration: 00m 51s) [14:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:10] T195370: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370 [14:38:38] CFisch_remote: in a meeting now [14:38:43] !log file exporter importer slot done [14:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:28] (03CR) 10Filippo Giunchedi: "See nits inline, LGTM in general. If we are running into problems with legitimate clients we can introduce rate limits instead of outright" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:39:38] (03CR) 10Filippo Giunchedi: [C: 031] mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:39:58] and CFisch_remote i made you an admin on test [14:40:18] CFisch_remote: so you can turn the filter back on after testing etc [14:40:25] nice, thanks again [14:40:26] yepp [14:44:36] 10Operations, 10Electron-PDFs, 10Proton, 10Readers-Web-Backlog, and 2 others: New service request: chromium-render/deploy - https://phabricator.wikimedia.org/T186748#4003039 (10akosiaris) Hello, I 've stalled adding LVS configuration for proton due to an instability we've been noticing. This instability i... [14:44:38] (03CR) 10Dzahn: [C: 04-1] "per our IRC dicussion, should be a separate vhost, not a ServerAlias" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [14:45:18] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4275730 (10WMDE-Fisch) [14:49:26] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:50:25] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 76182 bytes in 0.233 second response time [14:51:52] (03PS1) 10Herron: mailman: add recently observed false UA to bad_browser check [puppet] - 10https://gerrit.wikimedia.org/r/439922 (https://phabricator.wikimedia.org/T196989) [14:53:24] (03CR) 10Alexandros Kosiaris: [C: 031] mailman: add recently observed false UA to bad_browser check [puppet] - 10https://gerrit.wikimedia.org/r/439922 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:53:39] (03CR) 10Herron: [C: 032] mailman: add recently observed false UA to bad_browser check [puppet] - 10https://gerrit.wikimedia.org/r/439922 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [14:56:29] (03CR) 10Nuria: [C: 031] vcl: avoid consistent hashing for pipe traffic [puppet] - 10https://gerrit.wikimedia.org/r/439911 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema) [14:56:44] (03PS5) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [14:57:15] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4224988 (10Tobi_WMDE_SW) >>! In T195370#4275730, @WMDE-Fisch wrote: > Can now be tested, e.g. on https://test2.wikipedia.o... [14:57:17] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [14:58:03] 10Operations, 10Move-Files-To-Commons, 10TCB-Team, 10Wikimedia-Extension-setup, and 4 others: Deploy FileExporter and FileImporter to group0 - https://phabricator.wikimedia.org/T195370#4275778 (10Addshore) >>! In T195370#4275772, @Tobi_WMDE_SW wrote: >>>! In T195370#4275730, @WMDE-Fisch wrote: >> Can now b... [15:00:13] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4275788 (10herron) General http(s) request rate limiting has been enabled for requests matching `\/mailman.*` with a threshold of 50 requests... [15:00:33] (03PS1) 10Ema: vcl: properly choose backend in vcl_pipe [puppet] - 10https://gerrit.wikimedia.org/r/439929 (https://phabricator.wikimedia.org/T196553) [15:00:36] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4275789 (10awight) 05Open>03Resolved a:03awight [15:00:50] (03PS4) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) [15:01:47] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [15:02:15] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:03:23] (03PS6) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [15:03:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 (owner: 10Jcrespo) [15:03:49] seems upload esams having trouble [15:03:53] cc ema --^ [15:04:01] yup, cp3039 [15:04:02] thanks elukey [15:04:05] <3 [15:04:06] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [15:07:25] !log cp3039: restart varnish-backend [15:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:43] (03PS7) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [15:14:24] (03CR) 10Ema: [C: 032] vcl: properly choose backend in vcl_pipe [puppet] - 10https://gerrit.wikimedia.org/r/439929 (https://phabricator.wikimedia.org/T196553) (owner: 10Ema) [15:14:58] (03PS1) 10BBlack: esams rebalance: move cp3043 from text to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 [15:15:16] RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 0 [15:16:56] 10Operations, 10Traffic, 10User-Johan: Provide a multi-language user-faced warning regarding AES128-SHA deprecation - https://phabricator.wikimedia.org/T196371#4275890 (10Johan) Translations are being collected at https://meta.wikimedia.org/wiki/User:Johan_(WMF)/AES128-SHA [15:19:39] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt DNS entries for bast2002 (supposed to be in public VLAN) [dns] - 10https://gerrit.wikimedia.org/r/439786 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul) [15:19:42] (03PS1) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939 [15:20:27] (03CR) 10Alexandros Kosiaris: [C: 032] Mask the default uwsgi service for ores [puppet] - 10https://gerrit.wikimedia.org/r/437984 (owner: 10Muehlenhoff) [15:20:30] (03PS2) 10Alexandros Kosiaris: Mask the default uwsgi service for ores [puppet] - 10https://gerrit.wikimedia.org/r/437984 (owner: 10Muehlenhoff) [15:20:33] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Mask the default uwsgi service for ores [puppet] - 10https://gerrit.wikimedia.org/r/437984 (owner: 10Muehlenhoff) [15:20:48] (03PS2) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939 [15:21:12] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183#4275931 (10Paladox) [15:21:55] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [15:22:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:22:23] (03PS3) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) [15:22:59] (03CR) 10Dzahn: [C: 04-1] "this is adding a new director called "gerrit" (which already exists). what you want instead is adding a new domain to the existing directo" [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [15:23:42] (03PS4) 10Paladox: Add gerrit.wmfusercontent.org to common/cache/misc.yaml [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) [15:24:19] (03PS2) 10BBlack: esams rebalance: move cp3043 from text to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 [15:25:58] (03PS1) 10Addshore: Enable FileImporter monolog channel in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439941 (https://phabricator.wikimedia.org/T195370) [15:26:57] (03PS3) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) [15:27:48] (03PS4) 10Herron: mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) [15:28:55] (03CR) 10Herron: [C: 032] mailman: perform rbl checks on listinfo requests [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [15:29:13] (03PS5) 10Dzahn: cache::misc: Add gerrit backend, gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [15:29:36] !log cp3043 switching from text to upload shortly, downtimed in icinga for 2h - https://gerrit.wikimedia.org/r/c/operations/puppet/+/439936 [15:29:39] (03PS6) 10Dzahn: cache::misc: Add gerrit backend, gerrit.wmfusercontent.org [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [15:29:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:48] (03PS1) 10Giuseppe Lavagetto: jobrunner: reduce the number of old runners [puppet] - 10https://gerrit.wikimedia.org/r/439943 (https://phabricator.wikimedia.org/T197003) [15:29:50] (03PS1) 10Giuseppe Lavagetto: jobrunner: reduce to one redis server per datacenter [puppet] - 10https://gerrit.wikimedia.org/r/439944 (https://phabricator.wikimedia.org/T197003) [15:29:55] (03CR) 10Paladox: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [15:30:40] (03PS1) 10Giuseppe Lavagetto: Reduce the jobqueue redis to use just one server per dc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439945 (https://phabricator.wikimedia.org/T197003) [15:31:14] (03CR) 10Dzahn: [C: 032] "not affecting anything prod so far. gerrit itself isnt behind misc::web, this is for hosting avatars in the future" [puppet] - 10https://gerrit.wikimedia.org/r/439939 (https://phabricator.wikimedia.org/T191183) (owner: 10Paladox) [15:31:53] mutante thanks :) [15:33:48] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11463/ 2 runners per job type are more than enough given the current traffic." [puppet] - 10https://gerrit.wikimedia.org/r/439943 (https://phabricator.wikimedia.org/T197003) (owner: 10Giuseppe Lavagetto) [15:33:55] (03PS2) 10Giuseppe Lavagetto: jobrunner: reduce the number of old runners [puppet] - 10https://gerrit.wikimedia.org/r/439943 (https://phabricator.wikimedia.org/T197003) [15:34:41] (03CR) 10Ema: [C: 031] esams rebalance: move cp3043 from text to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 (owner: 10BBlack) [15:37:05] PROBLEM - mailman list info on fermium is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/mailman/listinfo/wikimedia-l - 298 bytes in 0.008 second response time [15:38:39] (03CR) 10Ema: [C: 031] Set eventstreams max_connections to 25 per varnish instance [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553) (owner: 10Ottomata) [15:38:57] (03PS5) 10Volans: Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) [15:39:39] (03CR) 10Dzahn: [C: 032] "confirmed with racadm getsysinfo" [puppet] - 10https://gerrit.wikimedia.org/r/439792 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul) [15:39:48] (03PS2) 10Dzahn: DHCP: Add MAC address for bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/439792 (https://phabricator.wikimedia.org/T196665) (owner: 10Papaul) [15:39:50] (03PS1) 10Herron: Revert "mailman: perform rbl checks on listinfo requests" [puppet] - 10https://gerrit.wikimedia.org/r/439948 [15:40:01] (03CR) 10jerkins-bot: [V: 04-1] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [15:40:18] !log cp3034 - nevermind, doing different approach later in the day, still pooled in text for now! [15:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:53] (03CR) 10Herron: [C: 032] "This is seeming too aggressive in testing after deployment. reverting." [puppet] - 10https://gerrit.wikimedia.org/r/439915 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [15:41:18] (03CR) 10Volans: "The only failure are the py27 tests as expected due to T196628" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [15:41:48] (03CR) 10Herron: [C: 032] Revert "mailman: perform rbl checks on listinfo requests" [puppet] - 10https://gerrit.wikimedia.org/r/439948 (owner: 10Herron) [15:41:54] (03PS2) 10Herron: Revert "mailman: perform rbl checks on listinfo requests" [puppet] - 10https://gerrit.wikimedia.org/r/439948 [15:42:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] Install LFS on scap targets (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/437719 (https://phabricator.wikimedia.org/T180627) (owner: 10Awight) [15:46:01] (03CR) 10Filippo Giunchedi: "Found in testing, e.g. my home ip address was being 403'd" [puppet] - 10https://gerrit.wikimedia.org/r/439948 (owner: 10Herron) [15:48:56] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4276003 (10chasemp) a:05Cmjohnson>03Bstorm [15:51:27] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [15:51:54] 10Operations, 10JADE, 10Scoring-platform-team, 10User-Joe: Scalability concerns creating a page per revision - https://phabricator.wikimedia.org/T196547#4276013 (10Halfak) I don't think we should be designing for the worst-case scenario here. There are many situations where content creation patterns are c... [15:51:56] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4276012 (10herron) I'm still able to generate noticeable load by hitting listinfo repeatedly within the 50req/5 min rate limit, so we might be... [15:52:18] (03CR) 10Dzahn: [C: 032] "looks right. needs manual rebase. doing that" [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [15:55:28] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4276025 (10chasemp) I think this is ready for OS install and such? I spoke with @bstorm who is going to take this on and may need... [15:55:53] (03PS2) 10Dzahn: DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [15:57:18] (03CR) 10Dzahn: [C: 032] DNS: Add mgmt & production DNS entries for lvs200[7-10] [dns] - 10https://gerrit.wikimedia.org/r/439803 (https://phabricator.wikimedia.org/T196560) (owner: 10Papaul) [15:58:05] (03CR) 10Muehlenhoff: [C: 031] "Looks good und unblocks Python 3 packages :-)" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [15:59:26] (03PS7) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [15:59:40] (03PS8) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [15:59:43] (03CR) 10Volans: [V: 032 C: 032] Client CLI: drop lsb_release dependency [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439910 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [16:00:00] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler02/11450/" [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553) (owner: 10Ottomata) [16:00:04] godog, moritzm, and _joe_: #bothumor My software never has bugs. It just develops random features. Rise for Puppet SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:04] (03PS2) 10Ottomata: Set eventstreams max_connections to 25 per varnish instance [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553) [16:00:09] (03CR) 10Ottomata: [V: 032 C: 032] Set eventstreams max_connections to 25 per varnish instance [puppet] - 10https://gerrit.wikimedia.org/r/439772 (https://phabricator.wikimedia.org/T196553) (owner: 10Ottomata) [16:00:28] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [16:02:15] (03PS9) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [16:02:31] jynus: is this a duplicate thing? https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/437382/ [16:02:47] (03PS10) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [16:03:31] no, that sould be kept [16:03:34] I already fixed that [16:03:41] by moving the contents [16:03:41] (03CR) 10jerkins-bot: [V: 04-1] Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [16:03:50] (03PS12) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [16:05:30] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [16:05:33] (03CR) 10Jcrespo: [C: 04-1] "I already fixed by moving the unrelated bits elsewhere, but mariadb maintenance for mediawiki should be kept there. It currently is empty," [puppet] - 10https://gerrit.wikimedia.org/r/437382 (owner: 10Dzahn) [16:05:35] PROBLEM - puppet last run on cp1051 is CRITICAL: CRITICAL: Puppet last ran 9 hours ago [16:06:05] 10Operations, 10Discovery, 10Icinga, 10Maps, and 2 others: Create Icinga alert when OSM replication lags on maps - https://phabricator.wikimedia.org/T167549#4276089 (10Gehel) 05Open>03Resolved [16:06:58] (03PS11) 10Paladox: Gerrit: Add support for adding additional domains to alias in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [16:07:01] 10Operations, 10Maps-Sprint, 10Patch-For-Review: reimage maps-test2004 to stretch and cassandra 2.2 - https://phabricator.wikimedia.org/T195741#4276128 (10Gehel) 05Open>03Resolved [16:07:03] (03PS1) 10Cmjohnson: Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189) [16:07:33] (03PS1) 10Volans: Updated src to v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439959 (https://phabricator.wikimedia.org/T191300) [16:07:35] (03PS1) 10Volans: Built wheels for v0.1.2 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439960 (https://phabricator.wikimedia.org/T191300) [16:09:06] (03CR) 10Volans: [V: 032 C: 032] Updated src to v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439959 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [16:09:23] (03CR) 10ArielGlenn: [C: 031] Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189) (owner: 10Cmjohnson) [16:09:47] (03PS2) 10Cmjohnson: Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189) [16:09:54] (03PS2) 10Volans: Built wheels for v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439960 (https://phabricator.wikimedia.org/T191300) [16:10:11] (03CR) 10Volans: [V: 032 C: 032] Built wheels for v0.1.3 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/439960 (https://phabricator.wikimedia.org/T191300) (owner: 10Volans) [16:10:25] RECOVERY - puppet last run on cp1051 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:10:55] (03PS12) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [16:11:02] (03CR) 10Cmjohnson: [C: 032] Snapshot1009, adding dhcpd and netboot [puppet] - 10https://gerrit.wikimedia.org/r/439958 (https://phabricator.wikimedia.org/T196189) (owner: 10Cmjohnson) [16:11:11] !log volans@deploy1001 Started deploy [debmonitor/deploy@0eca14a]: Release v0.1.3 [16:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:33] !log volans@deploy1001 Finished deploy [debmonitor/deploy@0eca14a]: Release v0.1.3 (duration: 00m 22s) [16:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:47] (03PS1) 10Jcrespo: mariadb mediawiki maintenance: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/439961 [16:13:47] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4276219 (10Imarlier) [16:16:15] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active [16:23:22] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4276270 (10fgiunchedi) A bigger nail in the coffin for GET requests is also going to be enabling caching by apache, at least for `listinfo` th... [16:25:55] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [16:27:31] (03PS13) 10Paladox: Gerrit: Add support for avatars url in apache [puppet] - 10https://gerrit.wikimedia.org/r/439783 [16:29:15] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [16:30:09] (03CR) 10Paladox: "Puppet compiler results https://puppet-compiler.wmflabs.org/compiler02/11467/" [puppet] - 10https://gerrit.wikimedia.org/r/439783 (owner: 10Paladox) [16:30:13] (03PS13) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [16:31:16] (03PS1) 10Papaul: DNS: Add production DNS entries for bast2002 [dns] - 10https://gerrit.wikimedia.org/r/439965 (https://phabricator.wikimedia.org/T196665) [16:31:37] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [16:39:25] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [16:40:25] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [16:42:16] (03PS3) 10BBlack: esams rebalance: add 3043 to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 [16:42:18] (03PS1) 10BBlack: esams rebalance: remove 3043 from text [puppet] - 10https://gerrit.wikimedia.org/r/439967 [16:42:30] (03PS14) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) [16:43:23] (03CR) 10Paladox: "New date is friday as no one will be around on monday." [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [16:44:00] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/438018 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [16:44:18] 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4276353 (10Cmjohnson) [16:44:26] 10Operations, 10ops-eqiad, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10User-ArielGlenn: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4249646 (10Cmjohnson) 05Open>03Resolved [16:45:57] 10Operations, 10Cloud-VPS, 10cloud-services-team (Kanban): 2018-01-02: labstore Tools and Misc share very full - https://phabricator.wikimedia.org/T183920#4276365 (10Bstorm) [16:46:00] 10Operations, 10Cloud-VPS, 10cloud-services-team: templatetiger is using 827G of 8T available tools nfs storage - https://phabricator.wikimedia.org/T183954#4276364 (10Bstorm) 05Open>03Resolved [16:49:35] (03PS1) 10ArielGlenn: add snapshot1009 as dumps testbed [puppet] - 10https://gerrit.wikimedia.org/r/439970 [16:49:56] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4276376 (10Cmjohnson) Still need add mac address to the dhcp file and the netboot.cfg. I just enabled the switch ports so once the... [16:51:48] (03PS1) 10Papaul: DNS: Add mgmt DNS entries for dns200[1-2] [dns] - 10https://gerrit.wikimedia.org/r/439973 (https://phabricator.wikimedia.org/T196493) [16:52:05] (03PS1) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) [16:53:08] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [16:53:18] (03CR) 10ArielGlenn: [C: 032] add snapshot1009 as dumps testbed [puppet] - 10https://gerrit.wikimedia.org/r/439970 (owner: 10ArielGlenn) [16:54:13] !log starting branch cut for 1.32.0-wmf.8 [16:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:46] (03PS2) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) [16:56:12] (03PS1) 10ArielGlenn: add snapshot1009 to dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/439975 [16:56:36] if snapshot1009 whines it's being installed, please ignore [16:57:00] (03CR) 10jerkins-bot: [V: 04-1] Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [16:57:42] (03CR) 10ArielGlenn: [V: 032 C: 032] add snapshot1009 to dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/439975 (owner: 10ArielGlenn) [16:58:44] 10Operations, 10ops-codfw, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install dns200[12].wikimedia.org - https://phabricator.wikimedia.org/T196493#4276412 (10Papaul) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / Parsoid / Citoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1700). [17:01:41] 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4276434 (10Gehel) @Eevans what do we need to do before uploading this to reprepro? I assume some coordination with @el... [17:02:47] (03Abandoned) 10Muehlenhoff: Add initial Debianisation of debmonitor-client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/439974 (https://phabricator.wikimedia.org/T191298) (owner: 10Muehlenhoff) [17:05:40] (03PS8) 10Jcrespo: [WIP] Add replication managing [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/439871 [17:07:21] 10Operations, 10Cassandra, 10Discovery, 10Maps, 10Patch-For-Review: cassandra 2.2.6-wmf4 is not compatible with python 2.7.13 (debian stretch) - https://phabricator.wikimedia.org/T196044#4276459 (10Eevans) >>! In T196044#4276434, @Gehel wrote: > @Eevans what do we need to do before uploading this to repr... [17:07:24] PROBLEM - nutcracker process on snapshot1009 is CRITICAL: NRPE: Command check_nutcracker not defined [17:07:44] PROBLEM - Check systemd state on snapshot1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:07:53] PROBLEM - Check whether ferm is active by checking the default input chain on snapshot1009 is CRITICAL: NRPE: Command check_ferm_active not defined [17:07:53] PROBLEM - nutcracker port on snapshot1009 is CRITICAL: NRPE: Command check_nutcracker_port not defined [17:09:53] RECOVERY - Check whether ferm is active by checking the default input chain on snapshot1009 is OK: OK ferm input default policy is set [17:11:46] (03CR) 10Paladox: "Delayed until after the sre offsite." [puppet] - 10https://gerrit.wikimedia.org/r/439444 (https://phabricator.wikimedia.org/T196812) (owner: 10Paladox) [17:14:04] PROBLEM - puppet last run on snapshot1009 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 1 minute ago with 3 failures. Failed resources (up to 3 shown): Package[lilypond],Package[php-luasandbox],Package[dumps/dumps] [17:16:55] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1034 - https://phabricator.wikimedia.org/T195569#4276487 (10Cmjohnson) 05Open>03Resolved Thanks! [17:19:50] 10Operations, 10ops-eqiad, 10DC-Ops: Power supply issue on maps1002 - https://phabricator.wikimedia.org/T196897#4276498 (10Cmjohnson) Your case was successfully submitted. Please note your Case ID: 5330129651 for future reference. [17:25:49] yeah we know about the puppet thing, ignore please [17:27:26] 10Operations, 10ops-eqiad, 10DBA: Bad disk on db1065 - https://phabricator.wikimedia.org/T196806#4276512 (10Marostegui) 05Open>03Resolved The new disk worked fine, thanks!! ``` root@db1065:~# megacli -LDPDInfo -aAll Adapter #0 Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name... [17:37:05] !log ariel@deploy1001 Started deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install [17:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:12] !log ariel@deploy1001 Finished deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install (duration: 00m 07s) [17:37:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:56] !log ariel@deploy1001 Started deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install [17:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:01] !log ariel@deploy1001 Finished deploy [dumps/dumps@038c8b3]: sync after snapshot1009 install (duration: 00m 04s) [17:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:35] almost there... one reboot to go [17:40:33] PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100% [17:41:17] it's rebooting.... [17:41:33] RECOVERY - nutcracker port on snapshot1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [17:41:43] RECOVERY - Host snapshot1009 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [17:42:11] (03PS1) 10Dduvall: Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987 [17:42:14] RECOVERY - nutcracker process on snapshot1009 is OK: PROCS OK: 1 process with UID = 113 (nutcracker), command name nutcracker [17:42:33] RECOVERY - Check systemd state on snapshot1009 is OK: OK - running: The system is fully operational [17:44:34] RECOVERY - puppet last run on snapshot1009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:48:33] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [17:51:43] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:53:18] (03PS1) 10ArielGlenn: get snapshot1001 ready for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/439991 [17:57:28] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/439778/ [17:58:10] (03CR) 10ArielGlenn: [C: 032] get snapshot1001 ready for decommissioning [puppet] - 10https://gerrit.wikimedia.org/r/439991 (owner: 10ArielGlenn) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1800) [18:03:03] !log dduvall@deploy1001 Started scap: testwiki to php-1.32.0-wmf.8 and rebuild l10n cache [18:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:45] 10Operations, 10Citoid, 10Code-Stewardship-Reviews, 10VisualEditor, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4276578 (10Jrbranaa) Added entry to developers/maintainers page. Please augment with more accurate description and li... [18:06:30] 10Operations, 10Citoid, 10Code-Stewardship-Reviews, 10VisualEditor, 10Services (watching): zotero translation server: code stewardship request - https://phabricator.wikimedia.org/T187194#4276587 (10Jrbranaa) >>! In T187194#4256587, @faidon wrote: > So we need to do //something// in a very short amount of... [18:08:13] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [18:09:13] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [18:12:31] AaronSchulz: does that need to go out with the train? [18:12:55] that = https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/439778/ [18:13:49] 10Operations, 10ops-eqiad, 10decommission, 10User-ArielGlenn: decommission snapshot1001 - https://phabricator.wikimedia.org/T197021#4276597 (10ArielGlenn) p:05Triage>03Normal [18:16:24] RECOVERY - Check systemd state on db1068 is OK: OK - running: The system is fully operational [18:19:13] marxarelli: would be nice (for T194403). It's not new to wmf8 though. [18:19:13] T194403: Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index. - https://phabricator.wikimedia.org/T194403 [18:21:35] AaronSchulz: kk. if you can get it reviewed/merged, i'll cherry-pick it to 1.32.0-wmf.8 and make sure it gets deployed [18:21:53] i'm chilling until the deploy window, so you have some time [18:29:43] AaronSchulz sorry if it looked like I was pressing you to do something, I wasn't [18:30:16] lately I am trying to be clear about ongoing errors to avoid missunderstandings [18:30:44] if the answer is "not a huge deal, will do at other time", it is ok too [18:35:49] I was trying to backport anyway :) [18:36:47] I would like to talk to you about roadmap of architecture- I think some things we do now will not work on multi-dc [18:36:59] (not now, but soon-ish) [18:40:06] (03PS1) 10Herron: mailman: whitelist icinga hosts from rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/439995 (https://phabricator.wikimedia.org/T196989) [18:40:50] (03PS1) 10Ottomata: Use Kafka main-eqiad for EventStreams service [puppet] - 10https://gerrit.wikimedia.org/r/439996 (https://phabricator.wikimedia.org/T185225) [18:41:25] (03CR) 10Herron: [C: 032] mailman: whitelist icinga hosts from rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/439995 (https://phabricator.wikimedia.org/T196989) (owner: 10Herron) [18:42:43] !log dduvall@deploy1001 Finished scap: testwiki to php-1.32.0-wmf.8 and rebuild l10n cache (duration: 39m 39s) [18:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:43:12] PROBLEM - Check systemd state on restbase-dev1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:56:18] RECOVERY - mailman list info on fermium is OK: HTTP OK: HTTP/1.1 200 OK - 15500 bytes in 0.152 second response time [18:57:02] !log restarted icinga service on einsteinium [18:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] marxarelli: That opportune time is upon us again. Time for a MediaWiki train deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T1900). [19:00:59] weeee [19:03:01] AaronSchulz: any update on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/439778/ ? train is leaving the station soon [19:03:17] RECOVERY - Check systemd state on restbase-dev1006 is OK: OK - running: The system is fully operational [19:13:31] (03CR) 10Dduvall: [C: 032] Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987 (owner: 10Dduvall) [19:15:03] (03Merged) 10jenkins-bot: Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987 (owner: 10Dduvall) [19:16:53] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.32.0-wmf.8 [19:16:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:59] (03CR) 10jenkins-bot: Group0 to 1.32.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/439987 (owner: 10Dduvall) [19:19:13] (03PS1) 10Urbanecm: Allow bcts on private&fishbowl wikis advanced privilege manipulation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) [19:19:15] (03PS1) 10Urbanecm: Clean legacy AddGroups/RemoveGroups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440001 (https://phabricator.wikimedia.org/T197024) [19:20:14] (03CR) 10Zoranzoki21: [C: 031] "440k :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [19:20:56] (03PS6) 10Zoranzoki21: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) [19:25:27] (03PS1) 10Urbanecm: Some wikis bureacurats are able to grant non-grantable groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) [19:29:57] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [19:29:57] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200) [19:30:57] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy [19:47:21] wow, the loading global options creates quite a lot of log messages in logstash [19:49:11] (03PS1) 10Ottomata: Add kafka_mirror_maker cert [labs/private] - 10https://gerrit.wikimedia.org/r/440008 [19:49:31] (03CR) 10Ottomata: [V: 032 C: 032] Add kafka_mirror_maker cert [labs/private] - 10https://gerrit.wikimedia.org/r/440008 (owner: 10Ottomata) [19:54:09] (03PS1) 10Urbanecm: Make ProofreadPage operate on correct namespaces in pmswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440009 (https://phabricator.wikimedia.org/T197033) [19:57:41] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics servers for mepps - https://phabricator.wikimedia.org/T192472#4276984 (10mepps) Thank you @Dzahn! I'm currently trying to log into JupyterHub and my wikitech credentials aren't working. I just wanted to make sure I was ad... [20:17:44] What should be done next, so ORES can be enabled on srwiki? [20:28:04] (03PS2) 10Herron: adds jforrester to deployment, deploy-service, & mobileapps-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/437819 (https://phabricator.wikimedia.org/T196566) (owner: 10RobH) [20:29:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277046 (10herron) Thanks! Moving forward with the patch now. [20:29:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277050 (10herron) [20:29:48] (03CR) 10Herron: [C: 032] adds jforrester to deployment, deploy-service, & mobileapps-admin groups [puppet] - 10https://gerrit.wikimedia.org/r/437819 (https://phabricator.wikimedia.org/T196566) (owner: 10RobH) [20:43:48] PROBLEM - Check Varnish expiry mailbox lag on cp3046 is CRITICAL: CRITICAL: expiry mailbox lag is 2032720 [20:51:32] (03PS1) 10Ottomata: Regenerate all certificates that were signed by the now decommed puppetmaster02 [labs/private] - 10https://gerrit.wikimedia.org/r/440016 (https://phabricator.wikimedia.org/T195686) [20:51:58] (03CR) 10Ottomata: [V: 032 C: 032] Regenerate all certificates that were signed by the now decommed puppetmaster02 [labs/private] - 10https://gerrit.wikimedia.org/r/440016 (https://phabricator.wikimedia.org/T195686) (owner: 10Ottomata) [20:52:06] (03CR) 10MarcoAurelio: "I proposed this in the past and the question was 'did they asked for it?'. Well, on one hand I do not oppose this change. On the other, I " (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440000 (https://phabricator.wikimedia.org/T197024) (owner: 10Urbanecm) [20:54:17] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received [20:55:17] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [20:55:45] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277168 (10herron) 05Open>03Resolved a:03herron Access has been provisioned @Jdforrester-WMF ``` deploy1001... [20:55:50] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277171 (10herron) [20:58:38] marxarelli: no CR yet [20:59:12] (03CR) 10MarcoAurelio: Some wikis bureacurats are able to grant non-grantable groups (037 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440002 (https://phabricator.wikimedia.org/T197026) (owner: 10Urbanecm) [21:03:58] PROBLEM - Check Varnish expiry mailbox lag on cp3046 is CRITICAL: CRITICAL: expiry mailbox lag is 2142389 [21:15:55] (03CR) 10Jon Harald Søby: [C: 031] Fix wrong language in ur.wiktionary namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/437974 (owner: 10Urbanecm) [21:18:07] 10Operations, 10ops-eqiad: ms-be1036 in power off status, not responsive to power on commands - https://phabricator.wikimedia.org/T196873#4277197 (10Cmjohnson) A new system board is required. I will coordinate with HP to get this taken care of ASAP. Required part is 775400-001 System I/O board (motherb... [21:31:56] (03CR) 10Imarlier: "Puppet compiler run looks right: https://puppet-compiler.wmflabs.org/compiler02/11468/" [puppet] - 10https://gerrit.wikimedia.org/r/439648 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [21:32:39] Anyone available to take a quick look and then merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/439648/ ? Literally a one line change... :-) [21:32:48] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [21:33:57] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [21:39:28] (03PS2) 10BBlack: esams rebalance: remove 3043 from text [puppet] - 10https://gerrit.wikimedia.org/r/439967 [21:39:30] (03PS4) 10BBlack: esams rebalance: add 3043 to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 [21:39:41] !log cp3043 - starting process to move to reimage into cache_upload [21:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:05] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp3043.esams.wmnet [21:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:29] (03CR) 10BBlack: [C: 032] esams rebalance: remove 3043 from text [puppet] - 10https://gerrit.wikimedia.org/r/439967 (owner: 10BBlack) [21:46:25] !log cp3046 - restart varnish backend for mbox lag [21:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:05] (03CR) 10Alex Monk: mediawiki::web::beta_sites: convert wikibooks to vhost (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/439894 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [21:54:28] RECOVERY - Check Varnish expiry mailbox lag on cp3046 is OK: OK: expiry mailbox lag is 0 [22:00:28] PROBLEM - HHVM rendering on mw2139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:27] RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 76173 bytes in 0.330 second response time [22:01:37] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received [22:02:38] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [22:05:48] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) is CRITICAL: Test [22:05:48] from en.wp.org in A4 format using optimized for reading on mobile devices returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) is CRITICAL: Test Respond file not found for a nonexistent title returned the unexpected status 503 (expecting: 404) [22:06:41] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4277285 (10Bstorm) Hmm. I'm coming up dry on how to find the MAC address in all the things here. labstore1008/9.mgmt.eqiad.wmnet... [22:10:27] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy [22:19:14] (03CR) 10BBlack: [C: 032] esams rebalance: add 3043 to upload [puppet] - 10https://gerrit.wikimedia.org/r/439936 (owner: 10BBlack) [22:19:28] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277313 (10mmodell) @marostegui: I canceled some of the queued jobs which should have helped somewhat. The only thing I know to do beyond this is to stop replicating from gerrit. [22:22:37] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install labstore1008 & labstore1009 - https://phabricator.wikimedia.org/T193655#4277317 (10Cmjohnson) Hrm, that's odd ....dns is setup and I setup idrac...I wondering if I forgot to connect the green mgmt cable.... [22:23:25] !log phabricator: taking phd offline to relieve the load on the m3 database cluster [22:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:58] !log phabricator: I scheduled a 24 hour downtime in icinga for the phd service, to give me time to work on this issue. See T196840 [22:25:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:03] T196840: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840 [22:27:46] (03PS3) 10Paladox: phabricator: Make phd.taskmasters configurable with hiera [puppet] - 10https://gerrit.wikimedia.org/r/439645 [22:33:07] 10Operations, 10ops-eqiad, 10DC-Ops: Power supply issue on maps1002 - https://phabricator.wikimedia.org/T196897#4277323 (10Cmjohnson) Dear Christopher Johnson, Hewlett Packard Enterprise Reference Number: 5330129651 STATUS: Customer Self Repair Part has been shipped Part/s shipped: 754377-001 Part descr... [22:33:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Reading-Infrastructure-Team-Backlog (Kanban): Requesting deployment access for jforrester - https://phabricator.wikimedia.org/T196566#4277329 (10Jdforrester-WMF) Thank you! Confirmed that I can log into deploy1001 in production now. [22:37:15] !log (from yesterday) resetting passwords for compromised accounts (T197046) [22:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:33] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp3043.esams.wmnet [22:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180612T2300). [23:00:04] Zoranzoki21: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:45] !log cp3043 - done, reimaged, in live service for cache_upload [23:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:01] I am here :) [23:05:53] !log resetting passwords for compromised accounts (T197046) [23:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:31] Is anyone who can swat active right now? [23:13:38] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Patch-For-Review: mailman listing unresponsive (fermium high latency) - https://phabricator.wikimedia.org/T196989#4277513 (10herron) >>! In T196989#4276270, @fgiunchedi wrote: > A bigger nail in the coffin for GET requests is also going to be enabling cach... [23:15:04] 10Operations, 10DBA, 10Gerrit, 10Phabricator: Massive increase of writes in m3 section - https://phabricator.wikimedia.org/T196840#4277521 (10mmodell) I'm deleting queued jobs in batches of 100,000. I've also reduced the number of phabricator workers to 5 (from 10) so overall there should be a reduction in... [23:15:28] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 503 (expecting: 200): /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a [23:15:28] ved [23:16:37] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [23:21:07] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Respond file not found for a nonexistent title) timed out before a response was received [23:22:07] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy [23:24:02] Is anyone SWATing? [23:24:35] I wait same [23:24:44] James_F: Hi, I wait same. I no know it [23:36:43] Zoranzoki21: I can SWAT now if you are still here. [23:36:47] I am here [23:36:50] Can you? [23:36:53] Great. Yup. [23:37:24] Niharika: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/436211/ [23:38:39] James_F: Have you got a minute to look over the patch? It seems sane but I'm not sure if people need to approve/have consensus before those domains can be added. [23:39:07] Niharika: You talk for my or another patch? [23:39:10] Sure. [23:39:42] Zoranzoki21: Yours. [23:40:13] Niharika: Other than the whitespace issues, looks fine to me. [23:40:53] James_F: Which whitespace issues? [23:42:04] (03PS7) 10Niharika29: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21) [23:42:11] Alright, fixed them. [23:42:24] Zoranzoki21: The comments were misaligned. [23:42:28] Thanks James_F. [23:42:37] (03CR) 10Niharika29: [C: 032] "SWAT." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21) [23:43:11] Niharika: Oh it.. Ok, thank you for fix and deploying [23:44:21] (03Merged) 10jenkins-bot: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21) [23:47:48] !log niharika29@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons T195270, T195928 (duration: 00m 59s) [23:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:53] T195270: Please add and to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T195270 [23:47:53] T195928: Please add Chilean government websites to the wgCopyUploadsDomains whitelist of Wikimedia Commons - https://phabricator.wikimedia.org/T195928 [23:47:57] (03CR) 10jenkins-bot: Add sites to the wgCopyUploadsDomains whitelist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436211 (https://phabricator.wikimedia.org/T195270) (owner: 10Zoranzoki21) [23:48:32] Thank you very much! [23:49:02] Zoranzoki21: you're welcome. :) [23:49:40] Good night